1 00:00:00,290 --> 00:00:01,510 In the last few videos, we talked 2 00:00:01,840 --> 00:00:02,770 about how to do forward-propagation 3 00:00:03,570 --> 00:00:05,200 and back-propagation in a 4 00:00:05,250 --> 00:00:07,560 neural network in order to compute derivatives. 5 00:00:08,800 --> 00:00:10,070 But back prop as an algorithm 6 00:00:10,580 --> 00:00:11,910 has a lot of details and, 7 00:00:12,170 --> 00:00:12,920 you know, can be a little 8 00:00:13,050 --> 00:00:14,930 bit tricky to implement. 9 00:00:15,700 --> 00:00:17,480 And one unfortunate property is 10 00:00:17,750 --> 00:00:18,690 that there are many 11 00:00:18,780 --> 00:00:20,080 ways to have subtle bugs in back 12 00:00:20,320 --> 00:00:22,000 prop so that if 13 00:00:22,140 --> 00:00:23,130 you run it with gradient descent 14 00:00:23,480 --> 00:00:26,590 or some other optimization algorithm, it could actually look like it's working. 15 00:00:27,240 --> 00:00:28,480 And, you know, your cost function J 16 00:00:28,700 --> 00:00:29,930 of theta may end up 17 00:00:30,090 --> 00:00:31,240 decreasing on every iteration 18 00:00:31,830 --> 00:00:33,660 of gradient descent, but this 19 00:00:33,830 --> 00:00:35,180 could pull through even though 20 00:00:35,440 --> 00:00:37,690 there might be some bug in your implementation of back prop. 21 00:00:38,400 --> 00:00:39,280 So it looks like J of 22 00:00:39,360 --> 00:00:40,830 theta is decreasing, but you 23 00:00:40,920 --> 00:00:42,230 might just wind up with 24 00:00:42,410 --> 00:00:43,760 a neural network that 25 00:00:43,880 --> 00:00:44,970 has a higher level of error 26 00:00:45,490 --> 00:00:46,540 than you would with a bug-free 27 00:00:46,780 --> 00:00:48,130 implementation and you might 28 00:00:48,330 --> 00:00:49,330 just not know that there 29 00:00:49,460 --> 00:00:50,470 was this subtle bug that's giving 30 00:00:50,530 --> 00:00:52,260 you this performance. 31 00:00:52,950 --> 00:00:53,320 So what can we do about this? 32 00:00:54,160 --> 00:00:55,940 There's an idea called gradient checking 33 00:00:56,790 --> 00:00:58,720 that eliminates almost all of these problems. 34 00:00:59,250 --> 00:01:00,550 So today, every time I 35 00:01:00,770 --> 00:01:02,150 implement back propagation or a 36 00:01:02,370 --> 00:01:03,320 similar gradient descent algorithm on 37 00:01:03,450 --> 00:01:04,950 the neural network or any other 38 00:01:05,640 --> 00:01:07,310 reasonably complex model, I 39 00:01:07,540 --> 00:01:08,840 always implement gradient checking. 40 00:01:09,650 --> 00:01:10,610 And if you do this it will 41 00:01:10,730 --> 00:01:12,010 help you make sure and 42 00:01:12,140 --> 00:01:13,410 sort of gain high confidence that 43 00:01:13,540 --> 00:01:14,940 your implementation of forward prop 44 00:01:15,370 --> 00:01:17,430 and back prop or whatever, is 100% correct. 45 00:01:18,240 --> 00:01:19,090 And in what I've seen 46 00:01:19,330 --> 00:01:20,880 this pretty much all the 47 00:01:21,160 --> 00:01:23,090 problems associated with sort of 48 00:01:23,420 --> 00:01:25,790 a buggy implementation of the background. 49 00:01:26,330 --> 00:01:27,470 And in the previous videos, 50 00:01:28,170 --> 00:01:29,120 I sort of ask you to take on 51 00:01:29,390 --> 00:01:30,950 faith that the formulas I 52 00:01:31,170 --> 00:01:33,000 gave for computing the deltas, and the 53 00:01:33,110 --> 00:01:34,220 D's, and so on, I ask 54 00:01:34,260 --> 00:01:35,480 you to take on faith that those 55 00:01:36,330 --> 00:01:37,600 actually do compute the gradients 56 00:01:38,180 --> 00:01:39,790 of the cost function, but once 57 00:01:40,150 --> 00:01:41,740 you implement numerical gradient checking, 58 00:01:42,130 --> 00:01:43,210 which is the topic of this video, 59 00:01:43,800 --> 00:01:45,250 you'll be able to verify for 60 00:01:45,350 --> 00:01:46,490 yourself that the code you're 61 00:01:46,610 --> 00:01:48,530 writing is indeed computing 62 00:01:49,600 --> 00:01:50,520 the derivative of the cost 63 00:01:50,820 --> 00:01:53,060 function J. So here's the idea. 64 00:01:53,550 --> 00:01:54,520 Consider the following example. 65 00:01:55,450 --> 00:01:56,230 Suppose I have the function 66 00:01:56,710 --> 00:01:58,140 J of theta, and I 67 00:01:58,250 --> 00:02:01,320 have some value, theta, and 68 00:02:01,610 --> 00:02:04,380 for this example, I'm going to assume that theta is just a real number. 69 00:02:05,470 --> 00:02:08,210 And let's say I want to estimate the derivative of this function at this point. 70 00:02:08,710 --> 00:02:10,220 And so the derivative is, you know, equal 71 00:02:10,750 --> 00:02:13,190 to the slope of that sort of tangent line. 72 00:02:14,270 --> 00:02:15,420 Here's how I'm going to numerically 73 00:02:16,180 --> 00:02:17,840 approximate the derivative, or 74 00:02:17,970 --> 00:02:19,190 rather here's a procedure for numerically 75 00:02:19,780 --> 00:02:21,480 approximating the derivative: I'm 76 00:02:21,800 --> 00:02:23,520 going to compute theta plus epsilon, 77 00:02:24,000 --> 00:02:25,550 so value a little bit to the right. 78 00:02:26,340 --> 00:02:27,900 And we are going to compute theta minus epsilon. 79 00:02:28,410 --> 00:02:30,800 And I'm going to look 80 00:02:30,950 --> 00:02:34,360 at those two points and connect 81 00:02:34,840 --> 00:02:35,860 them by a straight line. 82 00:02:43,160 --> 00:02:44,280 And I'm going to connect 83 00:02:44,480 --> 00:02:45,490 these two points by a straight 84 00:02:45,680 --> 00:02:46,430 line and I'm going to 85 00:02:46,480 --> 00:02:47,740 use the slope of that 86 00:02:48,000 --> 00:02:49,200 little red line as my 87 00:02:49,390 --> 00:02:50,940 approximation to the derivative, 88 00:02:51,460 --> 00:02:53,110 which is the true derivative is 89 00:02:53,280 --> 00:02:54,740 the slope of the blue line over there. 90 00:02:55,260 --> 00:02:56,660 So, you know, it seems like it would be a pretty good approximation. 91 00:02:58,220 --> 00:02:59,450 Mathematically, the slope of this 92 00:02:59,670 --> 00:03:01,340 red line is this vertical 93 00:03:01,890 --> 00:03:03,680 height, divided by this 94 00:03:03,890 --> 00:03:05,580 horizontal width, so this 95 00:03:05,840 --> 00:03:07,500 point on top is J of 96 00:03:08,920 --> 00:03:10,840 theta plus epsilon. This point 97 00:03:11,140 --> 00:03:13,020 here is J of theta minus epsilon. 98 00:03:13,830 --> 00:03:15,450 So this vertical difference is j 99 00:03:15,670 --> 00:03:17,530 of theta plus epsilon, minus J 100 00:03:17,810 --> 00:03:18,810 of theta, minus epsilon, and 101 00:03:19,700 --> 00:03:21,730 this horizontal distance is just 2 epsilon. 102 00:03:23,620 --> 00:03:25,340 So, my approximation is going 103 00:03:25,410 --> 00:03:27,280 to be that the derivative, 104 00:03:29,110 --> 00:03:30,160 with respect to theta of J of 105 00:03:30,490 --> 00:03:32,170 theta--add this value of 106 00:03:32,320 --> 00:03:34,950 theta--that that's approximately J 107 00:03:35,150 --> 00:03:36,860 of theta plus epsilon, minus 108 00:03:37,460 --> 00:03:40,600 J of theta, minus epsilon, over 2 epsilon. 109 00:03:42,280 --> 00:03:43,330 Usually, I use a pretty 110 00:03:43,600 --> 00:03:44,790 small value for epsilon and 111 00:03:45,040 --> 00:03:46,270 set epsilon to be maybe 112 00:03:46,530 --> 00:03:48,220 on the order of 10 to the minus 4. 113 00:03:48,740 --> 00:03:49,890 There's usually a large range 114 00:03:50,190 --> 00:03:52,280 of different values for epsilon that work just fine. 115 00:03:53,050 --> 00:03:54,470 And in fact, if you 116 00:03:55,280 --> 00:03:56,540 let epsilon become really small 117 00:03:57,010 --> 00:03:58,580 then, mathematically, this term here 118 00:03:59,210 --> 00:04:00,790 actually, mathematically, you know, 119 00:04:01,000 --> 00:04:02,340 becomes the derivative, becomes exactly 120 00:04:02,860 --> 00:04:04,310 the slope of the function at this point. 121 00:04:05,050 --> 00:04:05,730 It's just that we don't want 122 00:04:05,910 --> 00:04:06,980 to use epsilon that's too, too 123 00:04:07,170 --> 00:04:09,630 small because then you might run into numerical problems. 124 00:04:10,130 --> 00:04:11,070 So, you know, I usually use 125 00:04:11,380 --> 00:04:14,200 epsilon around 10 to the minus 4, say. 126 00:04:14,470 --> 00:04:15,220 And by the way some of you may 127 00:04:15,330 --> 00:04:17,590 have seen it alternative formula for 128 00:04:17,750 --> 00:04:19,710 estimating the derivative which is this formula. 129 00:04:21,590 --> 00:04:23,500 This one on the right is called the one-sided difference. 130 00:04:24,040 --> 00:04:26,580 Whereas, the formula on the left that's called a two-sided difference. 131 00:04:27,120 --> 00:04:28,670 The two-sided difference gives 132 00:04:28,890 --> 00:04:29,750 us a slightly more accurate estimate, 133 00:04:30,170 --> 00:04:31,410 so I usually use that rather 134 00:04:31,670 --> 00:04:33,540 than just this one-sided difference estimate. 135 00:04:35,900 --> 00:04:37,280 So, concretely, what you implement 136 00:04:37,750 --> 00:04:39,280 in Octave is you implement the following. 137 00:04:40,270 --> 00:04:41,490 You implement call to compute, gradApprox 138 00:04:41,600 --> 00:04:43,160 which is going to 139 00:04:43,270 --> 00:04:44,590 be approximation to zero relative 140 00:04:45,380 --> 00:04:46,820 as just, you know, this formula: J of 141 00:04:47,200 --> 00:04:48,550 theta plus epsilon, minus J of theta, 142 00:04:48,730 --> 00:04:50,800 minus epsilon, divided by two times epsilon. 143 00:04:52,060 --> 00:04:52,980 And this will give you a 144 00:04:53,100 --> 00:04:56,110 numerical estimate of the gradient at that point. 145 00:04:56,590 --> 00:04:58,910 And in this example it seems like it's a pretty good estimate. 146 00:05:01,970 --> 00:05:03,460 Now, on the previous slide, 147 00:05:03,710 --> 00:05:05,040 we consider the case of 148 00:05:05,290 --> 00:05:07,010 when theta was a real number. 149 00:05:08,000 --> 00:05:08,670 Now, let's look at the more 150 00:05:08,900 --> 00:05:11,650 general case of where theta is a vector parameter. 151 00:05:12,220 --> 00:05:13,270 So let's say theta is an 152 00:05:13,520 --> 00:05:14,610 Rn, and it might be unreal 153 00:05:15,000 --> 00:05:16,510 version of the parameters of 154 00:05:16,610 --> 00:05:18,010 our neural network. So 155 00:05:18,250 --> 00:05:19,580 theta is a vector that 156 00:05:19,800 --> 00:05:21,230 has n elements, theta 1 157 00:05:21,350 --> 00:05:25,100 up to theta n. We 158 00:05:25,240 --> 00:05:26,530 can then use a similar idea 159 00:05:27,080 --> 00:05:29,300 to approximate all of the partial derivative terms. 160 00:05:30,250 --> 00:05:31,730 Concretely, the partial derivative 161 00:05:32,420 --> 00:05:33,840 of a cost function with respect 162 00:05:34,110 --> 00:05:35,710 to the first parameter theta 163 00:05:36,110 --> 00:05:37,270 1, that can be 164 00:05:37,410 --> 00:05:40,270 obtained by taking J and increasing theta 1. 165 00:05:40,380 --> 00:05:43,030 So you have J of theta 1 plus epsilon, and so on 166 00:05:43,520 --> 00:05:44,780 minus J of this theta 167 00:05:45,520 --> 00:05:46,820 1 minus epsilon and divide it by 2 epsilon. 168 00:05:48,130 --> 00:05:49,660 The partial derivative respect to 169 00:05:49,740 --> 00:05:51,090 the second parameter theta 2, is 170 00:05:51,620 --> 00:05:53,130 again this thing, except you're 171 00:05:53,270 --> 00:05:54,370 taking J of, here you're 172 00:05:54,740 --> 00:05:56,240 increasing theta 2 by epsilon. 173 00:05:56,570 --> 00:05:58,290 And here you're decreasing theta 2 by epsilon. 174 00:05:59,100 --> 00:06:00,170 And so on down to the 175 00:06:00,260 --> 00:06:01,680 derivative with respect to 176 00:06:01,780 --> 00:06:02,780 theta n. Would be if you 177 00:06:03,030 --> 00:06:04,550 increase and decrease theta n 178 00:06:05,060 --> 00:06:06,140 by epsilon over there. 179 00:06:09,790 --> 00:06:11,550 So, these equations give 180 00:06:11,720 --> 00:06:13,580 you a way to numerically approximate 181 00:06:14,690 --> 00:06:16,500 the partial derivative of "J" 182 00:06:17,250 --> 00:06:20,100 with respect to any one of your parameters they derive. 183 00:06:23,640 --> 00:06:26,030 Concretely, what you implement is therefore, the following. 184 00:06:27,900 --> 00:06:29,260 We implement the following in Octave 185 00:06:29,820 --> 00:06:31,000 to numerically compute the derivatives. 186 00:06:32,220 --> 00:06:33,670 We say for i equals 1 187 00:06:33,790 --> 00:06:35,110 through n where n is 188 00:06:35,310 --> 00:06:37,140 the dimension of our parameter vector theta. 189 00:06:37,730 --> 00:06:40,680 And I usually do this with the unrolled version of the parameters. 190 00:06:41,250 --> 00:06:42,210 So you know theta is just 191 00:06:42,530 --> 00:06:44,770 a long list of all of my parameters in my neural networks. 192 00:06:46,230 --> 00:06:47,550 I'm going to set theta plus equals 193 00:06:47,830 --> 00:06:49,270 theta, then increase theta plus 194 00:06:49,630 --> 00:06:51,170 the ith element by epsilon. 195 00:06:51,660 --> 00:06:53,010 And so this is basically 196 00:06:53,720 --> 00:06:54,830 theta plus is equal to theta 197 00:06:55,340 --> 00:06:56,280 except for theta plus i, 198 00:06:56,580 --> 00:06:57,820 which is now incremented by epsilon. 199 00:06:58,310 --> 00:06:59,400 So if theta plus 200 00:07:00,810 --> 00:07:01,880 is equal to, right, theta 201 00:07:01,970 --> 00:07:03,370 1, theta 2, and so on and then theta 202 00:07:04,020 --> 00:07:05,160 i has epsilon added to 203 00:07:05,350 --> 00:07:06,590 it, and then it go down to 204 00:07:06,780 --> 00:07:08,440 theta n. So this is what theta plus is. 205 00:07:08,690 --> 00:07:11,340 And similarly these two 206 00:07:11,530 --> 00:07:13,380 lines set theta minus to 207 00:07:13,480 --> 00:07:15,090 something similar except that 208 00:07:15,560 --> 00:07:16,720 this, instead of theta i 209 00:07:16,930 --> 00:07:19,150 plus epsilon, this now becomes theta i minus epsilon. 210 00:07:20,670 --> 00:07:22,320 And then finally, you implement 211 00:07:22,830 --> 00:07:24,370 this gradApprox i, 212 00:07:25,190 --> 00:07:26,430 and this will give you 213 00:07:27,210 --> 00:07:28,420 your approximation to the partial 214 00:07:28,800 --> 00:07:30,250 derivative with respect to 215 00:07:30,430 --> 00:07:32,430 theta i of J of theta. 216 00:07:35,330 --> 00:07:36,420 And the way we use this 217 00:07:36,760 --> 00:07:38,530 in our neural network implementation is 218 00:07:38,850 --> 00:07:41,530 we would implement this, implement this 219 00:07:41,770 --> 00:07:43,310 FOR loop to compute, you know, the top partial 220 00:07:44,080 --> 00:07:45,570 derivative of the cost 221 00:07:45,860 --> 00:07:48,570 function with respect to every parameter in our network. 222 00:07:49,450 --> 00:07:51,120 And we can then take the 223 00:07:51,350 --> 00:07:53,070 gradient that we got from back prop. 224 00:07:53,740 --> 00:07:55,110 So DVec was the derivatives 225 00:07:55,770 --> 00:07:57,150 we got from back prop. 226 00:07:58,380 --> 00:08:00,610 Right, so back prop, back-propagation was 227 00:08:00,890 --> 00:08:02,030 a relatively efficient way to 228 00:08:02,090 --> 00:08:03,350 compute the derivatives or the 229 00:08:03,430 --> 00:08:04,970 partial derivatives of a 230 00:08:05,110 --> 00:08:06,850 cost function with respect to all of our parameters. 231 00:08:07,820 --> 00:08:08,960 And what I usually do 232 00:08:09,350 --> 00:08:10,820 is then take my numerically 233 00:08:11,440 --> 00:08:12,830 computed derivative, that is 234 00:08:12,960 --> 00:08:14,080 this gradApprox that we 235 00:08:14,250 --> 00:08:15,830 just had from up here and 236 00:08:15,920 --> 00:08:17,030 make sure that that is 237 00:08:17,290 --> 00:08:19,420 equal or approximately equal 238 00:08:19,980 --> 00:08:21,080 up to, you know, small values 239 00:08:21,810 --> 00:08:22,770 of numerical round off that is 240 00:08:22,970 --> 00:08:25,640 pretty close to the DVec that I got from back prop. 241 00:08:26,510 --> 00:08:27,460 And if these two ways 242 00:08:27,930 --> 00:08:29,550 of computing the derivative give me 243 00:08:29,650 --> 00:08:31,040 the same answer or at least give me 244 00:08:31,300 --> 00:08:33,670 very similar answers, you know, up to a few decimal places. 245 00:08:34,720 --> 00:08:36,560 Then I'm much more confident that 246 00:08:36,710 --> 00:08:38,720 my implementation of back prop is correct. 247 00:08:40,000 --> 00:08:41,230 And when I plug these DVec 248 00:08:41,660 --> 00:08:43,320 vectors into gradient descent 249 00:08:43,760 --> 00:08:45,610 or some advanced optimization algorithm, 250 00:08:45,760 --> 00:08:46,850 I can then be much 251 00:08:47,100 --> 00:08:48,870 more confident that I'm computing 252 00:08:49,360 --> 00:08:51,010 the derivatives correctly and therefore, 253 00:08:51,450 --> 00:08:52,670 that hopefully my codes will 254 00:08:52,790 --> 00:08:53,890 run correctly and do a 255 00:08:53,980 --> 00:08:55,570 good job optimizing J of theta. 256 00:08:57,700 --> 00:08:58,680 Finally, I want to put 257 00:08:58,860 --> 00:09:00,050 everything together and tell you 258 00:09:00,310 --> 00:09:02,950 how to implement this numerical gradient checking. 259 00:09:03,630 --> 00:09:04,370 Here's what I usually do. 260 00:09:04,970 --> 00:09:06,020 First thing I do, is implement 261 00:09:06,500 --> 00:09:08,180 back-propagation to compute defects. 262 00:09:08,490 --> 00:09:09,560 So, this is a procedure we talked 263 00:09:09,830 --> 00:09:11,250 about in an earlier video to 264 00:09:11,490 --> 00:09:13,530 compute DVec which may be our unrolled version of these matrices. 265 00:09:15,410 --> 00:09:16,550 Then what I do, is implement 266 00:09:17,010 --> 00:09:20,130 a numerical gradient checking to compute gradApprox. 267 00:09:20,590 --> 00:09:23,550 So this is what I described earlier in this video, in the previous slide. 268 00:09:24,900 --> 00:09:27,680 Then you should make sure that DVec and gradApprox 269 00:09:28,170 --> 00:09:30,860 gives similar values, you know, let's say up to a few decimal places. 270 00:09:32,270 --> 00:09:33,160 And finally, and this 271 00:09:33,240 --> 00:09:35,230 the important step, the more 272 00:09:35,480 --> 00:09:36,690 you start to use your code 273 00:09:37,000 --> 00:09:38,220 for learning, for seriously training 274 00:09:38,570 --> 00:09:40,960 your network, it is important to turn off gradient checking. 275 00:09:41,490 --> 00:09:42,800 And to no longer compute 276 00:09:43,630 --> 00:09:44,940 this gradApprox thing using 277 00:09:45,250 --> 00:09:47,660 the numerical derivative formulas that 278 00:09:47,980 --> 00:09:48,950 we talked about earlier in this 279 00:09:50,560 --> 00:09:50,560 video. 280 00:09:50,960 --> 00:09:52,180 And the reason for that is the 281 00:09:52,330 --> 00:09:53,800 numeric code gradient checking code, 282 00:09:54,120 --> 00:09:54,930 the stuff we talked about in 283 00:09:55,010 --> 00:09:56,220 this video, that's a very 284 00:09:56,650 --> 00:09:58,570 computationally expensive, that's a 285 00:09:58,600 --> 00:10:00,960 very slow way to try to approximate the derivative. 286 00:10:02,080 --> 00:10:03,490 Whereas in contrast, the back-propagation 287 00:10:03,900 --> 00:10:04,710 algorithm that we talked about 288 00:10:04,940 --> 00:10:06,120 earlier, that is the 289 00:10:06,370 --> 00:10:07,270 thing that we talked about earlier 290 00:10:07,460 --> 00:10:08,900 for computing, you know, D1, D2, 291 00:10:09,320 --> 00:10:11,620 D3, or for DVec. Back prop is 292 00:10:11,790 --> 00:10:14,930 a much more computationally efficient way of computing the derivatives. 293 00:10:17,070 --> 00:10:18,650 So once you've verified that 294 00:10:18,770 --> 00:10:20,270 your implementation of back-propagation is 295 00:10:20,620 --> 00:10:21,840 correct, you should turn 296 00:10:22,160 --> 00:10:24,140 off gradient checking, and just stop using that. 297 00:10:25,090 --> 00:10:26,380 So just to reiterate, you 298 00:10:26,540 --> 00:10:27,720 should be sure to disable your 299 00:10:27,840 --> 00:10:29,380 gradient checking code before running 300 00:10:29,690 --> 00:10:30,840 your algorithm for many 301 00:10:31,140 --> 00:10:32,560 iterations of gradient descent, or 302 00:10:32,670 --> 00:10:33,690 for many iterations of the 303 00:10:33,890 --> 00:10:34,990 advanced optimization algorithms in 304 00:10:35,820 --> 00:10:37,140 order to train your classifier. 305 00:10:37,980 --> 00:10:39,120 Concretely, if you were 306 00:10:39,290 --> 00:10:40,830 to run numerical gradient checking 307 00:10:41,340 --> 00:10:43,710 on every single integration of gradient 308 00:10:44,040 --> 00:10:44,650 descent, or if you were in the 309 00:10:44,850 --> 00:10:45,780 inner loop of your cost function, 310 00:10:46,670 --> 00:10:47,910 then your code will be very slow. 311 00:10:48,240 --> 00:10:49,860 Because the numerical gradient checking 312 00:10:50,180 --> 00:10:51,690 code is much slower than 313 00:10:51,900 --> 00:10:53,960 the back-propagation algorithm, than 314 00:10:54,160 --> 00:10:56,160 a back-propagation method where you 315 00:10:56,340 --> 00:10:57,650 remember we were computing delta 316 00:10:58,000 --> 00:10:59,820 4, delta 3, delta 2, and so on. 317 00:10:59,900 --> 00:11:02,470 That was the back-propagation algorithm. 318 00:11:02,990 --> 00:11:05,770 That is a much faster way to compute derivatives than gradient checking. 319 00:11:06,620 --> 00:11:08,400 So when you're ready, once 320 00:11:08,620 --> 00:11:10,190 you verify the implementation of back-propagation 321 00:11:10,480 --> 00:11:12,140 is correct, make sure you 322 00:11:12,220 --> 00:11:13,050 turn off, or you disable 323 00:11:13,640 --> 00:11:15,070 your gradient checking code while 324 00:11:15,270 --> 00:11:17,880 you train your algorithm, or else your code could run very slowly. 325 00:11:20,420 --> 00:11:22,470 So that's how you take gradients numerically. 326 00:11:23,110 --> 00:11:24,300 And that's how you can verify that 327 00:11:24,420 --> 00:11:26,300 your implementation of back-propagation is correct. 328 00:11:27,230 --> 00:11:29,290 Whenever I implement back-propagation or 329 00:11:29,450 --> 00:11:31,130 similar gradient descent algorithm for 330 00:11:31,250 --> 00:11:33,410 a complicated model, I always use gradient checking. 331 00:11:33,730 --> 00:11:36,230 This really helps me make sure that my code is correct.