1
00:00:00,090 --> 00:00:01,798
In the previous video, we talked about

2
00:00:01,857 --> 00:00:03,868
a cost function for the neural network.

3
00:00:04,139 --> 00:00:07,079
In this video, let's start to talk about an algorithm,

4
00:00:07,200 --> 00:00:09,062
for trying to minimize the cost function.

5
00:00:09,240 --> 00:00:12,735
In particular, we'll talk about the back propagation algorithm.

6
00:00:13,834 --> 00:00:15,380
Here's the cost function that

7
00:00:15,520 --> 00:00:17,905
we wrote down in the previous video.

8
00:00:17,972 --> 00:00:19,438
What we'd like to do is

9
00:00:19,484 --> 00:00:21,161
try to find parameters theta

10
00:00:21,246 --> 00:00:23,440
to try to minimize j of theta.

11
00:00:23,530 --> 00:00:25,782
In order to use either gradient descent

12
00:00:25,832 --> 00:00:28,625
or one of the advance optimization algorithms.

13
00:00:28,675 --> 00:00:30,206
What we need to do therefore is

14
00:00:30,249 --> 00:00:31,598
to write code that takes

15
00:00:31,645 --> 00:00:33,487
this input the parameters theta

16
00:00:33,540 --> 00:00:34,965
and computes j of theta

17
00:00:35,014 --> 00:00:37,364
and these partial derivative terms.

18
00:00:37,425 --> 00:00:38,763
Remember, that the parameters

19
00:00:38,790 --> 00:00:40,710
in the the neural network of these things,

20
00:00:40,760 --> 00:00:43,435
theta superscript l subscript ij,

21
00:00:43,492 --> 00:00:44,868
that's the real number

22
00:00:44,930 --> 00:00:47,185
and so, these are the partial derivative terms

23
00:00:47,249 --> 00:00:48,869
we need to compute.

24
00:00:48,900 --> 00:00:50,077
In order to compute the

25
00:00:50,115 --> 00:00:51,840
cost function j of theta,

26
00:00:51,883 --> 00:00:53,986
we just use this formula up here

27
00:00:54,042 --> 00:00:55,617
and so, what I want to do

28
00:00:55,655 --> 00:00:56,850
for the most of this video is

29
00:00:56,897 --> 00:00:58,595
focus on talking about

30
00:00:58,636 --> 00:00:59,952
how we can compute these

31
00:00:59,989 --> 00:01:01,994
partial derivative terms.

32
00:01:02,031 --> 00:01:03,812
Let's start by talking about

33
00:01:03,858 --> 00:01:05,512
the case of when we have only

34
00:01:05,556 --> 00:01:06,839
one training example,

35
00:01:06,872 --> 00:01:09,385
so imagine, if you will that our entire

36
00:01:09,432 --> 00:01:11,301
training set comprises only one

37
00:01:11,351 --> 00:01:14,006
training example which is a pair xy.

38
00:01:14,049 --> 00:01:15,591
I'm not going to write x1y1

39
00:01:15,629 --> 00:01:16,375
just write this.

40
00:01:16,410 --> 00:01:17,665
Write a one training example

41
00:01:17,718 --> 00:01:19,980
as xy and let's tap through

42
00:01:20,031 --> 00:01:21,423
the sequence of calculations

43
00:01:21,462 --> 00:01:24,332
we would do with this one training example.

44
00:01:25,754 --> 00:01:27,129
The first thing we do is

45
00:01:27,167 --> 00:01:29,175
we apply forward propagation in

46
00:01:29,212 --> 00:01:31,773
order to compute whether a hypotheses

47
00:01:31,813 --> 00:01:34,238
actually outputs given the input.

48
00:01:34,272 --> 00:01:36,734
Concretely, the called the

49
00:01:36,769 --> 00:01:39,025
a(1) is the activation values

50
00:01:39,071 --> 00:01:41,541
of this first layer that was the input there.

51
00:01:41,600 --> 00:01:43,452
So, I'm going to set that to x

52
00:01:43,505 --> 00:01:45,389
and then we're going to compute

53
00:01:45,435 --> 00:01:47,506
z(2) equals theta(1) a(1)

54
00:01:47,552 --> 00:01:49,919
and a(2) equals g, the sigmoid

55
00:01:49,980 --> 00:01:52,250
activation function applied to z(2)

56
00:01:52,310 --> 00:01:53,753
and this would give us our

57
00:01:53,800 --> 00:01:56,115
activations for the first middle layer.

58
00:01:56,162 --> 00:01:58,208
That is for layer two of the network

59
00:01:58,241 --> 00:02:00,649
and we also add those bias terms.

60
00:02:01,315 --> 00:02:03,132
Next we apply 2 more steps

61
00:02:03,176 --> 00:02:04,966
of this four and propagation

62
00:02:05,013 --> 00:02:08,328
to compute a(3) and a(4)

63
00:02:08,360 --> 00:02:11,458
which is also the upwards

64
00:02:11,505 --> 00:02:14,089
of a hypotheses h of x.

65
00:02:14,711 --> 00:02:18,103
So this is our vectorized implementation of

66
00:02:18,145 --> 00:02:19,228
forward propagation

67
00:02:19,276 --> 00:02:20,888
and it allows us to compute

68
00:02:20,938 --> 00:02:22,280
the activation values

69
00:02:22,345 --> 00:02:24,056
for all of the neurons

70
00:02:24,110 --> 00:02:25,948
in our neural network.

71
00:02:27,934 --> 00:02:29,608
Next, in order to compute

72
00:02:29,650 --> 00:02:30,967
the derivatives, we're going to use

73
00:02:31,026 --> 00:02:33,589
an algorithm called back propagation.

74
00:02:34,904 --> 00:02:37,765
The intuition of the back propagation algorithm

75
00:02:37,807 --> 00:02:38,430
is that for each note

76
00:02:38,430 --> 00:02:41,065
we're going to compute the term

77
00:02:41,126 --> 00:02:43,642
delta superscript l subscript j

78
00:02:43,676 --> 00:02:45,130
that's going to somehow

79
00:02:45,171 --> 00:02:46,310
represent the error

80
00:02:46,361 --> 00:02:48,511
of note j in the layer l.

81
00:02:48,552 --> 00:02:49,682
So, recall that

82
00:02:49,716 --> 00:02:52,313
a superscript l subscript j

83
00:02:52,355 --> 00:02:54,138
that does the activation of

84
00:02:54,185 --> 00:02:56,182
the j of unit in layer l

85
00:02:56,224 --> 00:02:58,001
and so, this delta term

86
00:02:58,045 --> 00:02:59,037
is in some sense

87
00:02:59,082 --> 00:03:00,978
going to capture our error

88
00:03:01,012 --> 00:03:03,618
in the activation of that neural duo.

89
00:03:03,650 --> 00:03:05,798
So, how we might wish the activation

90
00:03:05,823 --> 00:03:07,975
of that note is slightly different.

91
00:03:08,047 --> 00:03:09,670
Concretely, taking the example

92
00:03:10,270 --> 00:03:11,100
neural network that we have

93
00:03:11,360 --> 00:03:12,700
on the right which has four layers.

94
00:03:13,440 --> 00:03:15,710
And so capital L is equal to 4.

95
00:03:16,060 --> 00:03:17,120
For each output unit, we're going to compute this delta term.

96
00:03:17,400 --> 00:03:19,130
So, delta for the j of unit in the fourth layer is equal to

97
00:03:23,380 --> 00:03:24,490
just the activation of that

98
00:03:24,720 --> 00:03:26,350
unit minus what was

99
00:03:26,490 --> 00:03:28,650
the actual value of 0 in our training example.

100
00:03:29,900 --> 00:03:32,420
So, this term here can

101
00:03:32,580 --> 00:03:34,510
also be written h of

102
00:03:34,710 --> 00:03:38,040
x subscript j, right.

103
00:03:38,330 --> 00:03:39,640
So this delta term is just

104
00:03:39,930 --> 00:03:40,900
the difference between when a

105
00:03:41,290 --> 00:03:43,200
hypotheses output and what

106
00:03:43,370 --> 00:03:44,870
was the value of y

107
00:03:45,570 --> 00:03:46,900
in our training set whereas

108
00:03:47,060 --> 00:03:48,610
y subscript j is

109
00:03:48,750 --> 00:03:49,910
the j of element of the

110
00:03:50,090 --> 00:03:53,340
vector value y in our labeled training set.

111
00:03:56,200 --> 00:03:57,790
And by the way, if you

112
00:03:57,970 --> 00:04:00,460
think of delta a and

113
00:04:01,000 --> 00:04:02,350
y as vectors then you can

114
00:04:02,520 --> 00:04:03,760
also take those and come

115
00:04:04,030 --> 00:04:05,890
up with a vectorized implementation of

116
00:04:06,010 --> 00:04:07,310
it, which is just

117
00:04:07,690 --> 00:04:09,840
delta 4 gets set as

118
00:04:10,700 --> 00:04:14,330
a4 minus y. Where

119
00:04:14,560 --> 00:04:15,820
here, each of these delta

120
00:04:16,540 --> 00:04:18,080
4 a4 and y, each of

121
00:04:18,180 --> 00:04:19,860
these is a vector whose

122
00:04:20,640 --> 00:04:22,040
dimension is equal to

123
00:04:22,250 --> 00:04:24,150
the number of output units in our network.

124
00:04:25,210 --> 00:04:26,880
So we've now computed the

125
00:04:27,320 --> 00:04:28,670
era term's delta

126
00:04:29,020 --> 00:04:30,170
4 for our network.

127
00:04:31,440 --> 00:04:32,950
What we do next is compute

128
00:04:33,620 --> 00:04:36,280
the delta terms for the earlier layers in our network.

129
00:04:37,210 --> 00:04:38,690
Here's a formula for computing delta

130
00:04:39,010 --> 00:04:39,830
3 is delta 3 is equal

131
00:04:40,310 --> 00:04:42,050
to theta 3 transpose times delta 4.

132
00:04:42,560 --> 00:04:44,190
And this dot times, this

133
00:04:44,390 --> 00:04:46,390
is the element y's multiplication operation

134
00:04:47,580 --> 00:04:48,380
that we know from MATLAB.

135
00:04:49,160 --> 00:04:50,760
So delta 3 transpose delta

136
00:04:51,020 --> 00:04:52,860
4, that's a vector; g prime

137
00:04:53,480 --> 00:04:55,080
z3 that's also a vector

138
00:04:55,800 --> 00:04:57,370
and so dot times is

139
00:04:57,530 --> 00:04:59,670
in element y's multiplication between these two vectors.

140
00:05:01,460 --> 00:05:02,650
This term g prime of

141
00:05:02,740 --> 00:05:04,560
z3, that formally is actually

142
00:05:04,950 --> 00:05:06,420
the derivative of the activation

143
00:05:06,720 --> 00:05:08,740
function g evaluated at

144
00:05:08,890 --> 00:05:10,620
the input values given by z3.

145
00:05:10,760 --> 00:05:12,620
If you know calculus, you

146
00:05:12,710 --> 00:05:13,470
can try to work it out yourself

147
00:05:13,850 --> 00:05:16,100
and see that you can simplify it to the same answer that I get.

148
00:05:16,860 --> 00:05:19,690
But I'll just tell you pragmatically what that means.

149
00:05:20,000 --> 00:05:21,260
What you do to compute this g

150
00:05:21,460 --> 00:05:23,310
prime, these derivative terms is

151
00:05:23,510 --> 00:05:25,660
just a3 dot times1

152
00:05:26,010 --> 00:05:27,900
minus A3 where A3

153
00:05:28,160 --> 00:05:29,420
is the vector of activations.

154
00:05:30,150 --> 00:05:31,440
1 is the vector of

155
00:05:31,600 --> 00:05:33,240
ones and A3 is

156
00:05:34,020 --> 00:05:35,970
again the activation

157
00:05:36,290 --> 00:05:38,850
the vector of activation values for that layer.

158
00:05:39,170 --> 00:05:40,210
Next you apply a similar

159
00:05:40,540 --> 00:05:42,850
formula to compute delta 2

160
00:05:43,220 --> 00:05:45,230
where again that can be

161
00:05:45,670 --> 00:05:47,410
computed using a similar formula.

162
00:05:48,450 --> 00:05:49,950
Only now it is a2

163
00:05:50,120 --> 00:05:53,850
like so and I

164
00:05:53,960 --> 00:05:55,020
then prove it here but you

165
00:05:55,110 --> 00:05:56,400
can actually, it's possible to

166
00:05:56,490 --> 00:05:57,520
prove it if you know calculus

167
00:05:58,240 --> 00:05:59,520
that this expression is equal

168
00:05:59,860 --> 00:06:02,010
to mathematically, the derivative of

169
00:06:02,190 --> 00:06:03,570
the g function of the activation

170
00:06:04,040 --> 00:06:05,460
function, which I'm denoting

171
00:06:05,910 --> 00:06:08,540
by g prime. And finally,

172
00:06:09,270 --> 00:06:10,690
that's it and there is

173
00:06:10,860 --> 00:06:13,650
no delta1 term, because the

174
00:06:13,720 --> 00:06:15,590
first layer corresponds to the

175
00:06:15,630 --> 00:06:16,940
input layer and that's just the

176
00:06:17,000 --> 00:06:18,200
feature we observed in our

177
00:06:18,300 --> 00:06:20,380
training sets, so that doesn't have any error associated with that.

178
00:06:20,600 --> 00:06:22,080
It's not like, you know,

179
00:06:22,120 --> 00:06:23,680
we don't really want to try to change those values.

180
00:06:24,320 --> 00:06:25,240
And so we have delta

181
00:06:25,510 --> 00:06:28,090
terms only for layers 2, 3 and for this example.

182
00:06:30,170 --> 00:06:32,120
The name back propagation comes from

183
00:06:32,170 --> 00:06:33,260
the fact that we start by

184
00:06:33,350 --> 00:06:34,720
computing the delta term for

185
00:06:34,740 --> 00:06:36,190
the output layer and then

186
00:06:36,370 --> 00:06:37,480
we go back a layer and

187
00:06:37,880 --> 00:06:39,670
compute the delta terms for the

188
00:06:39,850 --> 00:06:41,050
third hidden layer and then we

189
00:06:41,180 --> 00:06:42,540
go back another step to compute

190
00:06:42,770 --> 00:06:44,070
delta 2 and so, we're sort of

191
00:06:44,660 --> 00:06:46,060
back propagating the errors from

192
00:06:46,280 --> 00:06:47,270
the output layer to layer 3

193
00:06:47,650 --> 00:06:50,180
to their to hence the name back complication.

194
00:06:51,270 --> 00:06:53,120
Finally, the derivation is

195
00:06:53,340 --> 00:06:56,510
surprisingly complicated, surprisingly involved but

196
00:06:56,820 --> 00:06:58,100
if you just do this few steps

197
00:06:58,280 --> 00:07:00,130
steps of computation it is possible

198
00:07:00,680 --> 00:07:02,540
to prove viral frankly some

199
00:07:02,810 --> 00:07:04,440
what complicated mathematical proof.

200
00:07:05,200 --> 00:07:07,410
It's possible to prove that if

201
00:07:07,560 --> 00:07:09,690
you ignore authorization then the

202
00:07:09,800 --> 00:07:11,080
partial derivative terms you want

203
00:07:12,220 --> 00:07:14,650
are exactly given by the

204
00:07:14,780 --> 00:07:17,690
activations and these delta terms.

205
00:07:17,870 --> 00:07:20,630
This is ignoring lambda or

206
00:07:20,780 --> 00:07:22,730
alternatively the regularization

207
00:07:23,770 --> 00:07:24,630
term lambda will

208
00:07:25,000 --> 00:07:25,170
equal to 0.

209
00:07:25,680 --> 00:07:27,130
We'll fix this detail later

210
00:07:27,470 --> 00:07:29,430
about the regularization term, but

211
00:07:29,620 --> 00:07:30,740
so by performing back propagation

212
00:07:31,610 --> 00:07:32,820
and computing these delta terms,

213
00:07:33,180 --> 00:07:34,240
you can, you know, pretty

214
00:07:34,530 --> 00:07:36,320
quickly compute these partial

215
00:07:36,380 --> 00:07:38,150
derivative terms for all of your parameters.

216
00:07:38,920 --> 00:07:40,020
So this is a lot of detail.

217
00:07:40,570 --> 00:07:41,900
Let's take everything and put

218
00:07:42,320 --> 00:07:43,660
it all together to talk about

219
00:07:44,120 --> 00:07:45,490
how to implement back propagation

220
00:07:46,560 --> 00:07:48,590
to compute derivatives with respect to your parameters.

221
00:07:49,790 --> 00:07:50,770
And for the case of when

222
00:07:51,000 --> 00:07:52,460
we have a large training

223
00:07:52,830 --> 00:07:53,850
set, not just a training

224
00:07:54,100 --> 00:07:56,320
set of one example, here's what we do.

225
00:07:57,290 --> 00:07:58,140
Suppose we have a training

226
00:07:58,270 --> 00:07:59,750
set of m examples like

227
00:07:59,900 --> 00:08:01,610
that shown here.

228
00:08:01,850 --> 00:08:02,600
The first thing we're going to do is

229
00:08:03,220 --> 00:08:04,560
we're going to set these delta

230
00:08:05,100 --> 00:08:07,270
l subscript i j. So this triangular symbol?

231
00:08:08,090 --> 00:08:09,990
That's actually the capital Greek

232
00:08:10,310 --> 00:08:11,980
alphabet delta .

233
00:08:12,050 --> 00:08:14,080
The symbol we had on the previous slide was the lower case delta.

234
00:08:14,390 --> 00:08:16,810
So the triangle is capital delta.

235
00:08:17,430 --> 00:08:18,490
We're gonna set this equal to zero

236
00:08:18,680 --> 00:08:21,930
for all values of l i j.

237
00:08:22,110 --> 00:08:23,850
Eventually, this capital delta

238
00:08:24,530 --> 00:08:25,830
l i j will be used

239
00:08:26,860 --> 00:08:29,920
to compute the partial

240
00:08:30,290 --> 00:08:31,570
derivative term, partial derivative

241
00:08:32,380 --> 00:08:35,240
respect to theta l i j of

242
00:08:35,430 --> 00:08:37,190
J of theta.

243
00:08:39,040 --> 00:08:40,210
So as we'll see in

244
00:08:40,480 --> 00:08:41,550
a second, these deltas are going

245
00:08:41,670 --> 00:08:43,700
to be used as accumulators that

246
00:08:43,950 --> 00:08:45,360
will slowly add things in

247
00:08:45,700 --> 00:08:47,130
order to compute these partial derivatives.

248
00:08:49,570 --> 00:08:51,920
Next, we're going to loop through our training set.

249
00:08:52,150 --> 00:08:53,270
So, we'll say for i equals

250
00:08:53,610 --> 00:08:55,400
1 through m and so

251
00:08:55,620 --> 00:08:57,270
for the i iteration, we're

252
00:08:57,410 --> 00:08:59,180
going to working with the training example xi, yi.

253
00:09:00,480 --> 00:09:03,220
So

254
00:09:03,720 --> 00:09:04,590
the first thing we're going to do

255
00:09:04,690 --> 00:09:06,120
is set a1 which is the

256
00:09:06,570 --> 00:09:07,830
activations of the input layer,

257
00:09:08,190 --> 00:09:09,030
set that to be equal to

258
00:09:09,950 --> 00:09:11,800
xi is the inputs for our

259
00:09:12,670 --> 00:09:15,070
i training example, and then

260
00:09:15,340 --> 00:09:17,590
we're going to perform forward propagation to

261
00:09:17,730 --> 00:09:19,400
compute the activations for

262
00:09:19,790 --> 00:09:20,900
layer two, layer three and so

263
00:09:21,170 --> 00:09:22,050
on up to the final

264
00:09:22,500 --> 00:09:25,190
layer, layer capital L. Next,

265
00:09:25,570 --> 00:09:26,970
we're going to use the output

266
00:09:27,280 --> 00:09:28,530
label yi from this

267
00:09:28,680 --> 00:09:29,870
specific example we're looking

268
00:09:30,340 --> 00:09:31,650
at to compute the error

269
00:09:31,950 --> 00:09:34,140
term for delta L for the output there.

270
00:09:34,480 --> 00:09:35,730
So delta L is what

271
00:09:35,880 --> 00:09:38,190
a hypotheses output minus what

272
00:09:38,660 --> 00:09:39,870
the target label was?

273
00:09:41,840 --> 00:09:42,560
And then we're going to use

274
00:09:42,850 --> 00:09:44,550
the back propagation algorithm to

275
00:09:44,740 --> 00:09:46,020
compute delta L minus 1,

276
00:09:46,220 --> 00:09:47,250
delta L minus 2, and

277
00:09:47,350 --> 00:09:49,880
so on down to delta 2 and once again

278
00:09:50,270 --> 00:09:51,380
there is now delta 1 because

279
00:09:51,460 --> 00:09:54,380
we don't associate an error term with the input layer.

280
00:09:57,000 --> 00:09:58,160
And finally, we're going to

281
00:09:58,340 --> 00:10:00,650
use these capital delta terms

282
00:10:01,190 --> 00:10:02,800
to accumulate these partial derivative

283
00:10:03,400 --> 00:10:05,670
terms that we wrote down on the previous line.

284
00:10:06,870 --> 00:10:07,870
And by the way, if you

285
00:10:07,960 --> 00:10:11,340
look at this expression, it's possible to vectorize this too.

286
00:10:12,020 --> 00:10:13,040
Concretely, if you think

287
00:10:13,310 --> 00:10:14,860
of delta ij as

288
00:10:15,000 --> 00:10:18,090
a matrix, indexed by subscript ij.

289
00:10:19,220 --> 00:10:20,590
Then, if delta L is

290
00:10:20,780 --> 00:10:22,040
a matrix we can rewrite

291
00:10:22,130 --> 00:10:24,100
this as delta L, gets

292
00:10:24,350 --> 00:10:26,710
updated as delta L plus

293
00:10:27,830 --> 00:10:29,370
lower case delta L plus

294
00:10:29,640 --> 00:10:32,780
one times aL transpose.

295
00:10:33,570 --> 00:10:35,380
So that's a vectorized implementation of

296
00:10:35,520 --> 00:10:37,150
this that automatically does

297
00:10:37,590 --> 00:10:38,850
an update for all values of

298
00:10:39,010 --> 00:10:41,250
i and j.
Finally, after

299
00:10:41,500 --> 00:10:43,480
executing the body of

300
00:10:43,580 --> 00:10:45,350
the four-loop we then go outside the four-loop

301
00:10:46,330 --> 00:10:47,000
and we compute the following.

302
00:10:47,440 --> 00:10:49,690
We compute capital D as

303
00:10:50,020 --> 00:10:51,400
follows and we have

304
00:10:51,510 --> 00:10:52,750
two separate cases for j

305
00:10:52,980 --> 00:10:54,890
equals zero and j not equals zero.

306
00:10:56,080 --> 00:10:57,250
The case of j equals zero

307
00:10:57,680 --> 00:10:58,730
corresponds to the bias

308
00:10:59,150 --> 00:11:00,030
term so when j equals

309
00:11:00,390 --> 00:11:01,320
zero that's why we're missing

310
00:11:01,800 --> 00:11:03,320
is an extra regularization term.

311
00:11:05,470 --> 00:11:06,850
Finally, while the formal proof

312
00:11:07,180 --> 00:11:08,970
is pretty complicated what you

313
00:11:09,030 --> 00:11:10,410
can show is that once

314
00:11:10,640 --> 00:11:12,530
you've computed these D terms,

315
00:11:13,510 --> 00:11:15,230
that is exactly the partial

316
00:11:15,640 --> 00:11:17,610
derivative of the cost

317
00:11:17,920 --> 00:11:19,230
function with respect to each

318
00:11:19,470 --> 00:11:20,890
of your perimeters and so you

319
00:11:21,040 --> 00:11:22,470
can use those in either gradient

320
00:11:22,610 --> 00:11:23,530
descent or in one of the advanced authorization

321
00:11:25,450 --> 00:11:25,450
algorithms.

322
00:11:28,310 --> 00:11:29,360
So that's the back propagation

323
00:11:29,990 --> 00:11:31,110
algorithm and how you compute

324
00:11:31,470 --> 00:11:33,080
derivatives of your cost

325
00:11:33,340 --> 00:11:34,710
function for a neural network.

326
00:11:35,470 --> 00:11:36,330
I know this looks like this

327
00:11:36,470 --> 00:11:38,810
was a lot of details and this was a lot of steps strung together.

328
00:11:39,460 --> 00:11:40,770
But both in the programming

329
00:11:41,100 --> 00:11:43,010
assignments write out and later

330
00:11:43,110 --> 00:11:44,580
in this video, we'll give

331
00:11:44,720 --> 00:11:45,900
you a summary of this so

332
00:11:46,050 --> 00:11:46,830
we can have all the pieces

333
00:11:47,260 --> 00:11:48,780
of the algorithm together so that

334
00:11:48,920 --> 00:11:50,550
you know exactly what you need

335
00:11:50,610 --> 00:11:51,760
to implement if you want

336
00:11:51,940 --> 00:11:53,460
to implement back propagation to compute

337
00:11:53,890 --> 00:11:56,432
the derivatives of your neural network's

338
00:11:56,574 --> 00:11:59,348
cost function with respect to those parameters.