1
00:00:00,540 --> 00:00:01,820
In the previous videos, we put

2
00:00:01,950 --> 00:00:03,220
together almost all

3
00:00:03,270 --> 00:00:04,620
the pieces you need in order

4
00:00:04,820 --> 00:00:07,170
to implement and train in your network.

5
00:00:07,940 --> 00:00:09,060
There's just one last idea I

6
00:00:09,120 --> 00:00:09,980
need to share with you, which

7
00:00:10,200 --> 00:00:11,570
is the idea of random initialization.

8
00:00:13,220 --> 00:00:14,360
When you're running an algorithm like

9
00:00:14,510 --> 00:00:15,990
gradient descent or also the

10
00:00:16,280 --> 00:00:17,810
advanced optimization algorithms, we

11
00:00:17,940 --> 00:00:20,770
need to pick some initial value for the parameters theta.

12
00:00:21,610 --> 00:00:22,990
So for the advanced optimization algorithm, you know,

13
00:00:23,570 --> 00:00:24,620
it assumes that you will

14
00:00:24,780 --> 00:00:26,090
pass it some initial value

15
00:00:26,700 --> 00:00:27,640
for the parameters theta.

16
00:00:29,010 --> 00:00:30,680
Now let's consider gradient descent.

17
00:00:31,320 --> 00:00:34,090
For that, you know, we also need to initialize theta to something.

18
00:00:34,580 --> 00:00:36,030
And then we can slowly take steps

19
00:00:36,680 --> 00:00:38,830
go downhill, using graded descent,

20
00:00:38,910 --> 00:00:40,920
to go downhill to minimize the function J of theta.

21
00:00:41,990 --> 00:00:43,960
So what do we set the initial value of theta to?

22
00:00:44,240 --> 00:00:47,000
Is it possible to set

23
00:00:47,520 --> 00:00:48,930
the initial value of theta

24
00:00:49,250 --> 00:00:50,450
to the vector of all zeroes.

25
00:00:51,870 --> 00:00:54,800
Whereas this worked okay when we were using logistic regression.

26
00:00:55,630 --> 00:00:56,690
Initializing all of your

27
00:00:56,760 --> 00:00:57,970
parameters to zero actually

28
00:00:58,310 --> 00:01:00,290
does not work when you're trading a neural network.

29
00:01:01,410 --> 00:01:03,150
Consider training the following neural network.

30
00:01:03,650 --> 00:01:06,430
And let's say we initialized all of the parameters in the network to zero.

31
00:01:07,970 --> 00:01:09,210
And if you do that then

32
00:01:09,780 --> 00:01:10,920
what that means is that

33
00:01:11,160 --> 00:01:13,870
at the initialization this blue weight, that I'm covering blue

34
00:01:15,390 --> 00:01:16,540
is going to equal to that weight.

35
00:01:17,510 --> 00:01:17,510
So, they're both zero.

36
00:01:18,580 --> 00:01:19,880
And this weight that I'm covering

37
00:01:20,330 --> 00:01:21,940
in in red, is equal to that weight.

38
00:01:22,550 --> 00:01:23,040
Which I'm covering it in red.

39
00:01:23,790 --> 00:01:25,280
And also this weight, well

40
00:01:25,620 --> 00:01:26,500
which I'm covering it in green

41
00:01:26,680 --> 00:01:28,940
is going to be equal to the value of that weight.

42
00:01:30,030 --> 00:01:32,820
And what that means is that both of your hidden units: a1 and a2

43
00:01:32,950 --> 00:01:35,940
are going to be computing the same function

44
00:01:36,660 --> 00:01:36,810
of your inputs.

45
00:01:37,810 --> 00:01:38,900
And thus, you end up with

46
00:01:39,500 --> 00:01:40,870
for everyone of your training your examples.

47
00:01:41,430 --> 00:01:43,640
You end up with a(2)1 equals a(2)2.

48
00:01:46,950 --> 00:01:48,700
and moreover because, I'm not

49
00:01:48,960 --> 00:01:50,050
going to show this too much

50
00:01:50,310 --> 00:01:51,420
detail, but because these out

51
00:01:51,580 --> 00:01:52,990
going weights are the same you

52
00:01:53,080 --> 00:01:54,630
can also show that the

53
00:01:54,710 --> 00:01:56,560
delta values are also going to be the same.

54
00:01:56,790 --> 00:01:57,790
So concretely, you end up

55
00:01:57,970 --> 00:02:00,070
with delta 1 1,

56
00:02:00,760 --> 00:02:02,900
delta 2 1, equals delta 2 2.

57
00:02:06,120 --> 00:02:07,150
And if you work through the

58
00:02:07,230 --> 00:02:08,480
map further, what you can

59
00:02:08,760 --> 00:02:09,990
show is that the partial derivatives

60
00:02:11,560 --> 00:02:14,080
with respect to your parameters will satisfy the following.

61
00:02:15,120 --> 00:02:16,710
That the partial derivative

62
00:02:17,550 --> 00:02:19,260
of the cost

63
00:02:19,580 --> 00:02:21,020
function with respect to

64
00:02:21,800 --> 00:02:23,680
writing out the derivatives respect to

65
00:02:23,900 --> 00:02:25,320
these two blue weights neural network.

66
00:02:26,190 --> 00:02:27,290
You'll find that these two partial

67
00:02:27,680 --> 00:02:30,340
derivatives are going to be equal to each other.

68
00:02:31,970 --> 00:02:33,180
And so, what this means, is

69
00:02:33,320 --> 00:02:35,820
that even after say, one gradient descent update.

70
00:02:36,690 --> 00:02:38,200
You're going to update, say this

71
00:02:38,470 --> 00:02:40,800
first blue weight with, you know, learning rate times this.

72
00:02:41,580 --> 00:02:42,500
And you're going to update the second

73
00:02:42,920 --> 00:02:44,620
blue weight to a sum learning rate times this.

74
00:02:44,820 --> 00:02:45,870
But what this means is

75
00:02:45,980 --> 00:02:47,090
that even after one gradient

76
00:02:47,420 --> 00:02:49,330
descent update, those two

77
00:02:49,680 --> 00:02:50,710
blue weights, those two blue

78
00:02:51,430 --> 00:02:53,050
color parameters will end

79
00:02:53,240 --> 00:02:54,960
up the same as each other.

80
00:02:55,190 --> 00:02:56,210
So they'll be some non-zero

81
00:02:56,750 --> 00:02:57,720
value now, but this value

82
00:02:58,550 --> 00:02:59,520
will be equal to that value.

83
00:03:00,360 --> 00:03:02,790
And similarly, even after one gradient descent update.

84
00:03:03,690 --> 00:03:05,740
This value will equal to that value.

85
00:03:06,170 --> 00:03:07,200
There will be some non-zero values.

86
00:03:07,640 --> 00:03:09,450
Just that the two red values will be equal to each other.

87
00:03:10,240 --> 00:03:11,760
And similarly the two green

88
00:03:12,060 --> 00:03:13,720
weights, they'll both change values

89
00:03:13,860 --> 00:03:16,350
but they'll both end up the same value as each other.

90
00:03:17,590 --> 00:03:19,020
So after each update, the parameters corresponding

91
00:03:19,740 --> 00:03:20,890
to the inputs going to each

92
00:03:21,060 --> 00:03:22,870
of the two hidden units identical.

93
00:03:23,700 --> 00:03:24,490
That's just saying that the two

94
00:03:24,710 --> 00:03:25,590
green weights must be sustained,

95
00:03:25,640 --> 00:03:26,310
the two red weights must be

96
00:03:26,550 --> 00:03:27,750
sustained, the two blue weights

97
00:03:28,010 --> 00:03:30,000
are still the same and what

98
00:03:30,160 --> 00:03:31,590
that means is that even after

99
00:03:31,770 --> 00:03:33,070
one iteration of say, gradient

100
00:03:33,460 --> 00:03:34,860
descent, you find that

101
00:03:35,600 --> 00:03:37,250
your two hidden units are still

102
00:03:37,800 --> 00:03:40,380
computing exactly the same function that the input.

103
00:03:40,830 --> 00:03:43,040
So you still have this a(1)2 equals a(2)2.

104
00:03:43,510 --> 00:03:45,200
And so you're back to this case.

105
00:03:45,930 --> 00:03:47,380
And as keep running gradient descent.

106
00:03:48,390 --> 00:03:50,940
The blue weights, the two blue weights will stay the same as each other.

107
00:03:51,190 --> 00:03:52,920
The two red weights will stay the same as each other.

108
00:03:53,060 --> 00:03:54,990
The two green weights will stay the same as each other.

109
00:03:55,160 --> 00:03:56,860
And what this means

110
00:03:57,130 --> 00:03:58,260
is that your neural network really

111
00:03:58,470 --> 00:03:59,980
can't compute very interesting functions.

112
00:04:00,700 --> 00:04:01,910
Imagine that you had

113
00:04:02,240 --> 00:04:03,670
not only two hidden

114
00:04:04,010 --> 00:04:05,470
units but imagine

115
00:04:05,640 --> 00:04:07,100
that you had many many hidden units.

116
00:04:08,080 --> 00:04:09,160
Then what this is saying is that

117
00:04:09,430 --> 00:04:10,680
all of your hidden units are

118
00:04:10,740 --> 00:04:12,320
computing the exact same

119
00:04:12,540 --> 00:04:16,300
feature, all of your hidden units are computing all of the exact same function of the input.

120
00:04:17,030 --> 00:04:18,980
And this is a highly redundant representation.

121
00:04:20,140 --> 00:04:21,010
Because that means that your

122
00:04:21,110 --> 00:04:24,160
final logistic regression unit, you know, really only gets to see one feature.

123
00:04:24,730 --> 00:04:25,460
Because all of these are the same

124
00:04:26,330 --> 00:04:28,690
and this prevents your neural network from learning something interesting.

125
00:04:31,600 --> 00:04:32,830
In order to get around this

126
00:04:32,960 --> 00:04:34,050
problem, the way we initialize

127
00:04:34,590 --> 00:04:35,680
the parameters of a neural network

128
00:04:36,050 --> 00:04:37,660
therefore, is with random initialization.

129
00:04:41,820 --> 00:04:43,130
Concretely, the problem we

130
00:04:43,250 --> 00:04:44,470
saw on the previous slide

131
00:04:44,760 --> 00:04:46,240
is sometimes called the problem

132
00:04:46,640 --> 00:04:49,040
of symmetric weights, that is if the weights all being the same.

133
00:04:49,810 --> 00:04:51,470
And so this random initialization

134
00:04:52,590 --> 00:04:54,240
is how we perform symmetry breaking.

135
00:04:55,520 --> 00:04:56,480
So what we do is we

136
00:04:56,680 --> 00:04:58,200
initialize each value of

137
00:04:58,310 --> 00:04:59,460
theta to a random

138
00:04:59,830 --> 00:05:01,300
number between minus epsilon and epsilon.

139
00:05:02,080 --> 00:05:03,200
So this is a notation to

140
00:05:03,310 --> 00:05:05,350
mean numbers between minus epsilon and plus epsilon.

141
00:05:06,330 --> 00:05:07,430
So my weights on my

142
00:05:07,540 --> 00:05:08,660
parameters are all going

143
00:05:08,710 --> 00:05:11,470
to be randomly initialized between minus epsilon and plus epsilon.

144
00:05:12,300 --> 00:05:13,330
The way I write code to do

145
00:05:13,420 --> 00:05:16,770
this in octave, this I've said you know theta 1 to be equal to this.

146
00:05:17,550 --> 00:05:19,620
So this rand 10 by 11.

147
00:05:19,910 --> 00:05:21,060
That's how you compute

148
00:05:21,640 --> 00:05:23,620
a random 10 by 11

149
00:05:24,670 --> 00:05:26,640
dimensional matrix, and all

150
00:05:27,070 --> 00:05:30,380
of the values are between 0 and 1.

151
00:05:30,580 --> 00:05:31,350
So these are going to

152
00:05:31,520 --> 00:05:32,700
be real numbers that take on

153
00:05:32,870 --> 00:05:34,860
any continuous values between 0 and 1.

154
00:05:35,450 --> 00:05:36,290
And so, if you take a

155
00:05:36,320 --> 00:05:37,440
number between 0 and

156
00:05:37,550 --> 00:05:38,310
1, multiply it by 2

157
00:05:38,590 --> 00:05:39,550
times an epsilon, and

158
00:05:39,600 --> 00:05:41,050
minus an epsilon, then you

159
00:05:41,160 --> 00:05:42,270
end up with a number that's

160
00:05:42,690 --> 00:05:44,160
between minus epsilon and plus epsilon.

161
00:05:45,640 --> 00:05:46,970
And incidentally, this epsilon here

162
00:05:47,230 --> 00:05:48,410
has nothing to do

163
00:05:48,730 --> 00:05:49,860
with the epsilon that we were

164
00:05:50,070 --> 00:05:51,710
using when we were doing gradient checking.

165
00:05:52,590 --> 00:05:54,070
So when we were doing numerical gradient checking,

166
00:05:54,850 --> 00:05:57,060
there we were adding some values of epsilon to theta.

167
00:05:57,430 --> 00:05:59,560
This is, you know, an unrelated value of epsilon.

168
00:05:59,780 --> 00:06:00,590
Which is why I am denoting

169
00:06:00,990 --> 00:06:02,200
in it epsilon, just to distinguish

170
00:06:02,480 --> 00:06:04,970
it from the value of epsilon we were using in gradient checking.

171
00:06:06,490 --> 00:06:07,590
Absolutely, if you want to

172
00:06:07,690 --> 00:06:09,620
initialize theta 2

173
00:06:09,640 --> 00:06:10,820
to a random 1 by

174
00:06:10,920 --> 00:06:13,430
11 matrix, you can do so using this piece of code here.

175
00:06:15,910 --> 00:06:17,460
So, to summarize, to

176
00:06:17,660 --> 00:06:18,910
train a neural network, what you

177
00:06:19,060 --> 00:06:20,850
should do is randomly initialize the

178
00:06:20,930 --> 00:06:21,810
weights to, you know, small

179
00:06:22,120 --> 00:06:23,370
values close to 0, between

180
00:06:23,740 --> 00:06:24,740
minus epsilon and plus epsilon,

181
00:06:25,160 --> 00:06:27,150
say, and then implement

182
00:06:27,620 --> 00:06:29,330
back-propagation; do gradient checking;

183
00:06:30,220 --> 00:06:31,300
and use either gradient

184
00:06:31,660 --> 00:06:32,620
descent or one of the

185
00:06:32,880 --> 00:06:34,860
advanced optimization algorithms to try

186
00:06:35,100 --> 00:06:36,250
to minimize J of theta

187
00:06:36,790 --> 00:06:37,860
as a function of the

188
00:06:38,050 --> 00:06:39,610
parameters theta starting from just

189
00:06:39,890 --> 00:06:41,900
randomly chosen initial value for the parameters.

190
00:06:42,970 --> 00:06:45,440
And by doing symmetry breaking, which is this process.

191
00:06:46,000 --> 00:06:47,110
Hopefully, gradient descent or the

192
00:06:47,580 --> 00:06:48,820
advanced optimization algorithms will be

193
00:06:48,980 --> 00:06:50,710
able to find a good value of theta.