1
00:00:00,290 --> 00:00:01,510
In the last few videos, we talked

2
00:00:01,840 --> 00:00:02,770
about how to do forward-propagation

3
00:00:03,570 --> 00:00:05,200
and back-propagation in a

4
00:00:05,250 --> 00:00:07,560
neural network in order to compute derivatives.

5
00:00:08,800 --> 00:00:10,070
But back prop as an algorithm

6
00:00:10,580 --> 00:00:11,910
has a lot of details and,

7
00:00:12,170 --> 00:00:12,920
you know, can be a little

8
00:00:13,050 --> 00:00:14,930
bit tricky to implement.

9
00:00:15,700 --> 00:00:17,480
And one unfortunate property is

10
00:00:17,750 --> 00:00:18,690
that there are many

11
00:00:18,780 --> 00:00:20,080
ways to have subtle bugs in back

12
00:00:20,320 --> 00:00:22,000
prop so that if

13
00:00:22,140 --> 00:00:23,130
you run it with gradient descent

14
00:00:23,480 --> 00:00:26,590
or some other optimization algorithm, it could actually look like it's working.

15
00:00:27,240 --> 00:00:28,480
And, you know, your cost function J

16
00:00:28,700 --> 00:00:29,930
of theta may end up

17
00:00:30,090 --> 00:00:31,240
decreasing on every iteration

18
00:00:31,830 --> 00:00:33,660
of gradient descent, but this

19
00:00:33,830 --> 00:00:35,180
could pull through even though

20
00:00:35,440 --> 00:00:37,690
there might be some bug in your implementation of back prop.

21
00:00:38,400 --> 00:00:39,280
So it looks like J of

22
00:00:39,360 --> 00:00:40,830
theta is decreasing, but you

23
00:00:40,920 --> 00:00:42,230
might just wind up with

24
00:00:42,410 --> 00:00:43,760
a neural network that

25
00:00:43,880 --> 00:00:44,970
has a higher level of error

26
00:00:45,490 --> 00:00:46,540
than you would with a bug-free

27
00:00:46,780 --> 00:00:48,130
implementation and you might

28
00:00:48,330 --> 00:00:49,330
just not know that there

29
00:00:49,460 --> 00:00:50,470
was this subtle bug that's giving

30
00:00:50,530 --> 00:00:52,260
you this performance.

31
00:00:52,950 --> 00:00:53,320
So what can we do about this?

32
00:00:54,160 --> 00:00:55,940
There's an idea called gradient checking

33
00:00:56,790 --> 00:00:58,720
that eliminates almost all of these problems.

34
00:00:59,250 --> 00:01:00,550
So today, every time I

35
00:01:00,770 --> 00:01:02,150
implement back propagation or a

36
00:01:02,370 --> 00:01:03,320
similar gradient descent algorithm on

37
00:01:03,450 --> 00:01:04,950
the neural network or any other

38
00:01:05,640 --> 00:01:07,310
reasonably complex model, I

39
00:01:07,540 --> 00:01:08,840
always implement gradient checking.

40
00:01:09,650 --> 00:01:10,610
And if you do this it will

41
00:01:10,730 --> 00:01:12,010
help you make sure and

42
00:01:12,140 --> 00:01:13,410
sort of gain high confidence that

43
00:01:13,540 --> 00:01:14,940
your implementation of forward prop

44
00:01:15,370 --> 00:01:17,430
and back prop or whatever, is 100% correct.

45
00:01:18,240 --> 00:01:19,090
And in what I've seen

46
00:01:19,330 --> 00:01:20,880
this pretty much all the

47
00:01:21,160 --> 00:01:23,090
problems associated with sort of

48
00:01:23,420 --> 00:01:25,790
a buggy implementation of the background.

49
00:01:26,330 --> 00:01:27,470
And in the previous videos,

50
00:01:28,170 --> 00:01:29,120
I sort of ask you to take on

51
00:01:29,390 --> 00:01:30,950
faith that the formulas I

52
00:01:31,170 --> 00:01:33,000
gave for computing the deltas, and the

53
00:01:33,110 --> 00:01:34,220
D's, and so on, I ask

54
00:01:34,260 --> 00:01:35,480
you to take on faith that those

55
00:01:36,330 --> 00:01:37,600
actually do compute the gradients

56
00:01:38,180 --> 00:01:39,790
of the cost function, but once

57
00:01:40,150 --> 00:01:41,740
you implement numerical gradient checking,

58
00:01:42,130 --> 00:01:43,210
which is the topic of this video,

59
00:01:43,800 --> 00:01:45,250
you'll be able to verify for

60
00:01:45,350 --> 00:01:46,490
yourself that the code you're

61
00:01:46,610 --> 00:01:48,530
writing is indeed computing

62
00:01:49,600 --> 00:01:50,520
the derivative of the cost

63
00:01:50,820 --> 00:01:53,060
function J. 
So here's the idea.

64
00:01:53,550 --> 00:01:54,520
Consider the following example.

65
00:01:55,450 --> 00:01:56,230
Suppose I have the function

66
00:01:56,710 --> 00:01:58,140
J of theta, and I

67
00:01:58,250 --> 00:02:01,320
have some value, theta, and

68
00:02:01,610 --> 00:02:04,380
for this example, I'm going to assume that theta is just a real number.

69
00:02:05,470 --> 00:02:08,210
And let's say I want to estimate the derivative of this function at this point.

70
00:02:08,710 --> 00:02:10,220
And so the derivative is, you know, equal

71
00:02:10,750 --> 00:02:13,190
to the slope of that sort of tangent line.

72
00:02:14,270 --> 00:02:15,420
Here's how I'm going to numerically

73
00:02:16,180 --> 00:02:17,840
approximate the derivative, or

74
00:02:17,970 --> 00:02:19,190
rather here's a procedure for numerically

75
00:02:19,780 --> 00:02:21,480
approximating the derivative: I'm

76
00:02:21,800 --> 00:02:23,520
going to compute theta plus epsilon,

77
00:02:24,000 --> 00:02:25,550
so value a little bit to the right.

78
00:02:26,340 --> 00:02:27,900
And we are going to compute theta minus epsilon.

79
00:02:28,410 --> 00:02:30,800
And I'm going to look

80
00:02:30,950 --> 00:02:34,360
at those two points and connect

81
00:02:34,840 --> 00:02:35,860
them by a straight line.

82
00:02:43,160 --> 00:02:44,280
And I'm going to connect

83
00:02:44,480 --> 00:02:45,490
these two points by a straight

84
00:02:45,680 --> 00:02:46,430
line and I'm going to

85
00:02:46,480 --> 00:02:47,740
use the slope of that

86
00:02:48,000 --> 00:02:49,200
little red line as my

87
00:02:49,390 --> 00:02:50,940
approximation to the derivative,

88
00:02:51,460 --> 00:02:53,110
which is the true derivative is

89
00:02:53,280 --> 00:02:54,740
the slope of the blue line over there.

90
00:02:55,260 --> 00:02:56,660
So, you know, it seems like it would be a pretty good approximation.

91
00:02:58,220 --> 00:02:59,450
Mathematically, the slope of this

92
00:02:59,670 --> 00:03:01,340
red line is this vertical

93
00:03:01,890 --> 00:03:03,680
height, divided by this

94
00:03:03,890 --> 00:03:05,580
horizontal width, so this

95
00:03:05,840 --> 00:03:07,500
point on top is J of

96
00:03:08,920 --> 00:03:10,840
theta plus epsilon. This point

97
00:03:11,140 --> 00:03:13,020
here is J of theta minus epsilon.

98
00:03:13,830 --> 00:03:15,450
So this vertical difference is j

99
00:03:15,670 --> 00:03:17,530
of theta plus epsilon, minus J

100
00:03:17,810 --> 00:03:18,810
of theta, minus epsilon, and

101
00:03:19,700 --> 00:03:21,730
this horizontal distance is just 2 epsilon.

102
00:03:23,620 --> 00:03:25,340
So, my approximation is going

103
00:03:25,410 --> 00:03:27,280
to be that the derivative,

104
00:03:29,110 --> 00:03:30,160
with respect to theta of J of

105
00:03:30,490 --> 00:03:32,170
theta--add this value of

106
00:03:32,320 --> 00:03:34,950
theta--that that's approximately J

107
00:03:35,150 --> 00:03:36,860
of theta plus epsilon, minus

108
00:03:37,460 --> 00:03:40,600
J of theta, minus epsilon, over 2 epsilon.

109
00:03:42,280 --> 00:03:43,330
Usually, I use a pretty

110
00:03:43,600 --> 00:03:44,790
small value for epsilon and

111
00:03:45,040 --> 00:03:46,270
set epsilon to be maybe

112
00:03:46,530 --> 00:03:48,220
on the order of 10 to the minus 4.

113
00:03:48,740 --> 00:03:49,890
There's usually a large range

114
00:03:50,190 --> 00:03:52,280
of different values for epsilon that work just fine.

115
00:03:53,050 --> 00:03:54,470
And in fact, if you

116
00:03:55,280 --> 00:03:56,540
let epsilon become really small

117
00:03:57,010 --> 00:03:58,580
then, mathematically, this term here

118
00:03:59,210 --> 00:04:00,790
actually, mathematically, you know,

119
00:04:01,000 --> 00:04:02,340
becomes the derivative, becomes exactly

120
00:04:02,860 --> 00:04:04,310
the slope of the function at this point.

121
00:04:05,050 --> 00:04:05,730
It's just that we don't want

122
00:04:05,910 --> 00:04:06,980
to use epsilon that's too, too

123
00:04:07,170 --> 00:04:09,630
small because then you might run into numerical problems.

124
00:04:10,130 --> 00:04:11,070
So, you know, I usually use

125
00:04:11,380 --> 00:04:14,200
epsilon around 10 to the minus 4, say.

126
00:04:14,470 --> 00:04:15,220
And by the way some of you may

127
00:04:15,330 --> 00:04:17,590
have seen it alternative formula for

128
00:04:17,750 --> 00:04:19,710
estimating the derivative which is this formula.

129
00:04:21,590 --> 00:04:23,500
This one on the right is called the one-sided difference.

130
00:04:24,040 --> 00:04:26,580
Whereas, the formula on the left that's called a two-sided difference.

131
00:04:27,120 --> 00:04:28,670
The two-sided difference gives

132
00:04:28,890 --> 00:04:29,750
us a slightly more accurate estimate,

133
00:04:30,170 --> 00:04:31,410
so I usually use that rather

134
00:04:31,670 --> 00:04:33,540
than just this one-sided difference estimate.

135
00:04:35,900 --> 00:04:37,280
So, concretely, what you implement

136
00:04:37,750 --> 00:04:39,280
in Octave is you implement the following.

137
00:04:40,270 --> 00:04:41,490
You implement call to compute, gradApprox

138
00:04:41,600 --> 00:04:43,160
which is going to

139
00:04:43,270 --> 00:04:44,590
be approximation to zero relative

140
00:04:45,380 --> 00:04:46,820
as just, you know, this formula: J of

141
00:04:47,200 --> 00:04:48,550
theta plus epsilon, minus J of theta,

142
00:04:48,730 --> 00:04:50,800
minus epsilon, divided by two times epsilon.

143
00:04:52,060 --> 00:04:52,980
And this will give you a

144
00:04:53,100 --> 00:04:56,110
numerical estimate of the gradient at that point.

145
00:04:56,590 --> 00:04:58,910
And in this example it seems like it's a pretty good estimate.

146
00:05:01,970 --> 00:05:03,460
Now, on the previous slide,

147
00:05:03,710 --> 00:05:05,040
we consider the case of

148
00:05:05,290 --> 00:05:07,010
when theta was a real number.

149
00:05:08,000 --> 00:05:08,670
Now, let's look at the more

150
00:05:08,900 --> 00:05:11,650
general case of where theta is a vector parameter.

151
00:05:12,220 --> 00:05:13,270
So let's say theta is an

152
00:05:13,520 --> 00:05:14,610
Rn, and it might be unreal

153
00:05:15,000 --> 00:05:16,510
version of the parameters of

154
00:05:16,610 --> 00:05:18,010
our neural network. So

155
00:05:18,250 --> 00:05:19,580
theta is a vector that

156
00:05:19,800 --> 00:05:21,230
has n elements, theta 1

157
00:05:21,350 --> 00:05:25,100
up to theta n. We

158
00:05:25,240 --> 00:05:26,530
can then use a similar idea

159
00:05:27,080 --> 00:05:29,300
to approximate all of the partial derivative terms.

160
00:05:30,250 --> 00:05:31,730
Concretely, the partial derivative

161
00:05:32,420 --> 00:05:33,840
of a cost function with respect

162
00:05:34,110 --> 00:05:35,710
to the first parameter theta

163
00:05:36,110 --> 00:05:37,270
1, that can be

164
00:05:37,410 --> 00:05:40,270
obtained by taking J and increasing theta 1.

165
00:05:40,380 --> 00:05:43,030
So you have J of theta 1 plus epsilon, and so on

166
00:05:43,520 --> 00:05:44,780
minus J of this theta

167
00:05:45,520 --> 00:05:46,820
1 minus epsilon and divide it by 2 epsilon.

168
00:05:48,130 --> 00:05:49,660
The partial derivative respect to

169
00:05:49,740 --> 00:05:51,090
the second parameter theta 2, is

170
00:05:51,620 --> 00:05:53,130
again this thing, except you're

171
00:05:53,270 --> 00:05:54,370
taking J of, here you're

172
00:05:54,740 --> 00:05:56,240
increasing theta 2 by epsilon.

173
00:05:56,570 --> 00:05:58,290
And here you're decreasing theta 2 by epsilon.

174
00:05:59,100 --> 00:06:00,170
And so on down to the

175
00:06:00,260 --> 00:06:01,680
derivative with respect to

176
00:06:01,780 --> 00:06:02,780
theta n. Would be if you

177
00:06:03,030 --> 00:06:04,550
increase and decrease theta n

178
00:06:05,060 --> 00:06:06,140
by epsilon over there.

179
00:06:09,790 --> 00:06:11,550
So, these equations give

180
00:06:11,720 --> 00:06:13,580
you a way to numerically approximate

181
00:06:14,690 --> 00:06:16,500
the partial derivative of "J"

182
00:06:17,250 --> 00:06:20,100
with respect to any one of your parameters they derive.

183
00:06:23,640 --> 00:06:26,030
Concretely, what you implement is therefore, the following.

184
00:06:27,900 --> 00:06:29,260
We implement the following in Octave

185
00:06:29,820 --> 00:06:31,000
to numerically compute the derivatives.

186
00:06:32,220 --> 00:06:33,670
We say for i equals 1

187
00:06:33,790 --> 00:06:35,110
through n where n is

188
00:06:35,310 --> 00:06:37,140
the dimension of our parameter vector theta.

189
00:06:37,730 --> 00:06:40,680
And I usually do this with the unrolled version of the parameters.

190
00:06:41,250 --> 00:06:42,210
So you know theta is just

191
00:06:42,530 --> 00:06:44,770
a long list of all of my parameters in my neural networks.

192
00:06:46,230 --> 00:06:47,550
I'm going to set theta plus equals

193
00:06:47,830 --> 00:06:49,270
theta, then increase theta plus

194
00:06:49,630 --> 00:06:51,170
the ith element by epsilon.

195
00:06:51,660 --> 00:06:53,010
And so this is basically

196
00:06:53,720 --> 00:06:54,830
theta plus is equal to theta

197
00:06:55,340 --> 00:06:56,280
except for theta plus i,

198
00:06:56,580 --> 00:06:57,820
which is now incremented by epsilon.

199
00:06:58,310 --> 00:06:59,400
So if theta plus

200
00:07:00,810 --> 00:07:01,880
is equal to, right, theta

201
00:07:01,970 --> 00:07:03,370
1, theta 2, and so on and then theta

202
00:07:04,020 --> 00:07:05,160
i has epsilon added to

203
00:07:05,350 --> 00:07:06,590
it, and then it go down to

204
00:07:06,780 --> 00:07:08,440
theta n. So this is what theta plus is.

205
00:07:08,690 --> 00:07:11,340
And similarly these two

206
00:07:11,530 --> 00:07:13,380
lines set theta minus to

207
00:07:13,480 --> 00:07:15,090
something similar except that

208
00:07:15,560 --> 00:07:16,720
this, instead of theta i

209
00:07:16,930 --> 00:07:19,150
plus epsilon, this now becomes theta i minus epsilon.

210
00:07:20,670 --> 00:07:22,320
And then finally, you implement

211
00:07:22,830 --> 00:07:24,370
this gradApprox i,

212
00:07:25,190 --> 00:07:26,430
and this will give you

213
00:07:27,210 --> 00:07:28,420
your approximation to the partial

214
00:07:28,800 --> 00:07:30,250
derivative with respect to

215
00:07:30,430 --> 00:07:32,430
theta i of J of theta.

216
00:07:35,330 --> 00:07:36,420
And the way we use this

217
00:07:36,760 --> 00:07:38,530
in our neural network implementation is

218
00:07:38,850 --> 00:07:41,530
we would implement this, implement this

219
00:07:41,770 --> 00:07:43,310
FOR loop to compute, you know, the top partial

220
00:07:44,080 --> 00:07:45,570
derivative of the cost

221
00:07:45,860 --> 00:07:48,570
function with respect to every parameter in our network.

222
00:07:49,450 --> 00:07:51,120
And we can then take the

223
00:07:51,350 --> 00:07:53,070
gradient that we got from back prop.

224
00:07:53,740 --> 00:07:55,110
So DVec was the derivatives

225
00:07:55,770 --> 00:07:57,150
we got from back prop.

226
00:07:58,380 --> 00:08:00,610
Right, so back prop, back-propagation was

227
00:08:00,890 --> 00:08:02,030
a relatively efficient way to

228
00:08:02,090 --> 00:08:03,350
compute the derivatives or the

229
00:08:03,430 --> 00:08:04,970
partial derivatives of a

230
00:08:05,110 --> 00:08:06,850
cost function with respect to all of our parameters.

231
00:08:07,820 --> 00:08:08,960
And what I usually do

232
00:08:09,350 --> 00:08:10,820
is then take my numerically

233
00:08:11,440 --> 00:08:12,830
computed derivative, that is

234
00:08:12,960 --> 00:08:14,080
this gradApprox that we

235
00:08:14,250 --> 00:08:15,830
just had from up here and

236
00:08:15,920 --> 00:08:17,030
make sure that that is

237
00:08:17,290 --> 00:08:19,420
equal or approximately equal

238
00:08:19,980 --> 00:08:21,080
up to, you know, small values

239
00:08:21,810 --> 00:08:22,770
of numerical round off that is

240
00:08:22,970 --> 00:08:25,640
pretty close to the DVec that I got from back prop.

241
00:08:26,510 --> 00:08:27,460
And if these two ways

242
00:08:27,930 --> 00:08:29,550
of computing the derivative give me

243
00:08:29,650 --> 00:08:31,040
the same answer or at least give me

244
00:08:31,300 --> 00:08:33,670
very similar answers, you know, up to a few decimal places.

245
00:08:34,720 --> 00:08:36,560
Then I'm much more confident that

246
00:08:36,710 --> 00:08:38,720
my implementation of back prop is correct.

247
00:08:40,000 --> 00:08:41,230
And when I plug these DVec

248
00:08:41,660 --> 00:08:43,320
vectors into gradient descent

249
00:08:43,760 --> 00:08:45,610
or some advanced optimization algorithm,

250
00:08:45,760 --> 00:08:46,850
I can then be much

251
00:08:47,100 --> 00:08:48,870
more confident that I'm computing

252
00:08:49,360 --> 00:08:51,010
the derivatives correctly and therefore,

253
00:08:51,450 --> 00:08:52,670
that hopefully my codes will

254
00:08:52,790 --> 00:08:53,890
run correctly and do a

255
00:08:53,980 --> 00:08:55,570
good job optimizing J of theta.

256
00:08:57,700 --> 00:08:58,680
Finally, I want to put

257
00:08:58,860 --> 00:09:00,050
everything together and tell you

258
00:09:00,310 --> 00:09:02,950
how to implement this numerical gradient checking.

259
00:09:03,630 --> 00:09:04,370
Here's what I usually do.

260
00:09:04,970 --> 00:09:06,020
First thing I do, is implement

261
00:09:06,500 --> 00:09:08,180
back-propagation to compute defects.

262
00:09:08,490 --> 00:09:09,560
So, this is a procedure we talked

263
00:09:09,830 --> 00:09:11,250
about in an earlier video to

264
00:09:11,490 --> 00:09:13,530
compute DVec which may be our unrolled version of these matrices.

265
00:09:15,410 --> 00:09:16,550
Then what I do, is implement

266
00:09:17,010 --> 00:09:20,130
a numerical gradient checking to compute gradApprox.

267
00:09:20,590 --> 00:09:23,550
So this is what I described earlier in this video, in the previous slide.

268
00:09:24,900 --> 00:09:27,680
Then you should make sure that DVec and gradApprox

269
00:09:28,170 --> 00:09:30,860
gives similar values, you know, let's say up to a few decimal places.

270
00:09:32,270 --> 00:09:33,160
And finally, and this

271
00:09:33,240 --> 00:09:35,230
the important step, the more

272
00:09:35,480 --> 00:09:36,690
you start to use your code

273
00:09:37,000 --> 00:09:38,220
for learning, for seriously training

274
00:09:38,570 --> 00:09:40,960
your network, it is important to turn off gradient checking.

275
00:09:41,490 --> 00:09:42,800
And to no longer compute

276
00:09:43,630 --> 00:09:44,940
this gradApprox thing using

277
00:09:45,250 --> 00:09:47,660
the numerical derivative formulas that

278
00:09:47,980 --> 00:09:48,950
we talked about earlier in this

279
00:09:50,560 --> 00:09:50,560
video.

280
00:09:50,960 --> 00:09:52,180
And the reason for that is the

281
00:09:52,330 --> 00:09:53,800
numeric code gradient checking code,

282
00:09:54,120 --> 00:09:54,930
the stuff we talked about in

283
00:09:55,010 --> 00:09:56,220
this video, that's a very

284
00:09:56,650 --> 00:09:58,570
computationally expensive, that's a

285
00:09:58,600 --> 00:10:00,960
very slow way to try to approximate the derivative.

286
00:10:02,080 --> 00:10:03,490
Whereas in contrast, the back-propagation

287
00:10:03,900 --> 00:10:04,710
algorithm that we talked about

288
00:10:04,940 --> 00:10:06,120
earlier, that is the

289
00:10:06,370 --> 00:10:07,270
thing that we talked about earlier

290
00:10:07,460 --> 00:10:08,900
for computing, you know, D1, D2,

291
00:10:09,320 --> 00:10:11,620
D3, or for DVec. Back prop is

292
00:10:11,790 --> 00:10:14,930
a much more computationally efficient way of computing the derivatives.

293
00:10:17,070 --> 00:10:18,650
So once you've verified that

294
00:10:18,770 --> 00:10:20,270
your implementation of back-propagation is

295
00:10:20,620 --> 00:10:21,840
correct, you should turn

296
00:10:22,160 --> 00:10:24,140
off gradient checking, and just stop using that.

297
00:10:25,090 --> 00:10:26,380
So just to reiterate, you

298
00:10:26,540 --> 00:10:27,720
should be sure to disable your

299
00:10:27,840 --> 00:10:29,380
gradient checking code before running

300
00:10:29,690 --> 00:10:30,840
your algorithm for many

301
00:10:31,140 --> 00:10:32,560
iterations of gradient descent, or

302
00:10:32,670 --> 00:10:33,690
for many iterations of the

303
00:10:33,890 --> 00:10:34,990
advanced optimization algorithms in

304
00:10:35,820 --> 00:10:37,140
order to train your classifier.

305
00:10:37,980 --> 00:10:39,120
Concretely, if you were

306
00:10:39,290 --> 00:10:40,830
to run numerical gradient checking

307
00:10:41,340 --> 00:10:43,710
on every single integration of gradient

308
00:10:44,040 --> 00:10:44,650
descent, or if you were in the

309
00:10:44,850 --> 00:10:45,780
inner loop of your cost function,

310
00:10:46,670 --> 00:10:47,910
then your code will be very slow.

311
00:10:48,240 --> 00:10:49,860
Because the numerical gradient checking

312
00:10:50,180 --> 00:10:51,690
code is much slower than

313
00:10:51,900 --> 00:10:53,960
the back-propagation algorithm, than

314
00:10:54,160 --> 00:10:56,160
a back-propagation method where you

315
00:10:56,340 --> 00:10:57,650
remember we were computing delta

316
00:10:58,000 --> 00:10:59,820
4, delta 3, delta 2, and so on.

317
00:10:59,900 --> 00:11:02,470
That was the back-propagation algorithm.

318
00:11:02,990 --> 00:11:05,770
That is a much faster way to compute derivatives than gradient checking.

319
00:11:06,620 --> 00:11:08,400
So when you're ready, once

320
00:11:08,620 --> 00:11:10,190
you verify the implementation of back-propagation

321
00:11:10,480 --> 00:11:12,140
is correct, make sure you

322
00:11:12,220 --> 00:11:13,050
turn off, or you disable

323
00:11:13,640 --> 00:11:15,070
your gradient checking code while

324
00:11:15,270 --> 00:11:17,880
you train your algorithm, or else your code could run very slowly.

325
00:11:20,420 --> 00:11:22,470
So that's how you take gradients numerically.

326
00:11:23,110 --> 00:11:24,300
And that's how you can verify that

327
00:11:24,420 --> 00:11:26,300
your implementation of back-propagation is correct.

328
00:11:27,230 --> 00:11:29,290
Whenever I implement back-propagation or

329
00:11:29,450 --> 00:11:31,130
similar gradient descent algorithm for

330
00:11:31,250 --> 00:11:33,410
a complicated model, I always use gradient checking.

331
00:11:33,730 --> 00:11:36,230
This really helps me make sure that my code is correct.