In the last few videos, we talked about how to do forward-propagation and back-propagation in a neural network in order to compute derivatives. But back prop as an algorithm has a lot of details and, you know, can be a little bit tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop so that if you run it with gradient descent or some other optimization algorithm, it could actually look like it's working. And, you know, your cost function J of theta may end up decreasing on every iteration of gradient descent, but this could pull through even though there might be some bug in your implementation of back prop. So it looks like J of theta is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug-free implementation and you might just not know that there was this subtle bug that's giving you this performance. So what can we do about this? There's an idea called gradient checking that eliminates almost all of these problems. So today, every time I implement back propagation or a similar gradient descent algorithm on the neural network or any other reasonably complex model, I always implement gradient checking. And if you do this it will help you make sure and sort of gain high confidence that your implementation of forward prop and back prop or whatever, is 100% correct. And in what I've seen this pretty much all the problems associated with sort of a buggy implementation of the background. And in the previous videos, I sort of ask you to take on faith that the formulas I gave for computing the deltas, and the D's, and so on, I ask you to take on faith that those actually do compute the gradients of the cost function, but once you implement numerical gradient checking, which is the topic of this video, you'll be able to verify for yourself that the code you're writing is indeed computing the derivative of the cost function J. So here's the idea. Consider the following example. Suppose I have the function J of theta, and I have some value, theta, and for this example, I'm going to assume that theta is just a real number. And let's say I want to estimate the derivative of this function at this point. And so the derivative is, you know, equal to the slope of that sort of tangent line. Here's how I'm going to numerically approximate the derivative, or rather here's a procedure for numerically approximating the derivative: I'm going to compute theta plus epsilon, so value a little bit to the right. And we are going to compute theta minus epsilon. And I'm going to look at those two points and connect them by a straight line. And I'm going to connect these two points by a straight line and I'm going to use the slope of that little red line as my approximation to the derivative, which is the true derivative is the slope of the blue line over there. So, you know, it seems like it would be a pretty good approximation. Mathematically, the slope of this red line is this vertical height, divided by this horizontal width, so this point on top is J of theta plus epsilon. This point here is J of theta minus epsilon. So this vertical difference is j of theta plus epsilon, minus J of theta, minus epsilon, and this horizontal distance is just 2 epsilon. So, my approximation is going to be that the derivative, with respect to theta of J of theta--add this value of theta--that that's approximately J of theta plus epsilon, minus J of theta, minus epsilon, over 2 epsilon. Usually, I use a pretty small value for epsilon and set epsilon to be maybe on the order of 10 to the minus 4. There's usually a large range of different values for epsilon that work just fine. And in fact, if you let epsilon become really small then, mathematically, this term here actually, mathematically, you know, becomes the derivative, becomes exactly the slope of the function at this point. It's just that we don't want to use epsilon that's too, too small because then you might run into numerical problems. So, you know, I usually use epsilon around 10 to the minus 4, say. And by the way some of you may have seen it alternative formula for estimating the derivative which is this formula. This one on the right is called the one-sided difference. Whereas, the formula on the left that's called a two-sided difference. The two-sided difference gives us a slightly more accurate estimate, so I usually use that rather than just this one-sided difference estimate. So, concretely, what you implement in Octave is you implement the following. You implement call to compute, gradApprox which is going to be approximation to zero relative as just, you know, this formula: J of theta plus epsilon, minus J of theta, minus epsilon, divided by two times epsilon. And this will give you a numerical estimate of the gradient at that point. And in this example it seems like it's a pretty good estimate. Now, on the previous slide, we consider the case of when theta was a real number. Now, let's look at the more general case of where theta is a vector parameter. So let's say theta is an Rn, and it might be unreal version of the parameters of our neural network. So theta is a vector that has n elements, theta 1 up to theta n. We can then use a similar idea to approximate all of the partial derivative terms. Concretely, the partial derivative of a cost function with respect to the first parameter theta 1, that can be obtained by taking J and increasing theta 1. So you have J of theta 1 plus epsilon, and so on minus J of this theta 1 minus epsilon and divide it by 2 epsilon. The partial derivative respect to the second parameter theta 2, is again this thing, except you're taking J of, here you're increasing theta 2 by epsilon. And here you're decreasing theta 2 by epsilon. And so on down to the derivative with respect to theta n. Would be if you increase and decrease theta n by epsilon over there. So, these equations give you a way to numerically approximate the partial derivative of "J" with respect to any one of your parameters they derive. Concretely, what you implement is therefore, the following. We implement the following in Octave to numerically compute the derivatives. We say for i equals 1 through n where n is the dimension of our parameter vector theta. And I usually do this with the unrolled version of the parameters. So you know theta is just a long list of all of my parameters in my neural networks. I'm going to set theta plus equals theta, then increase theta plus the ith element by epsilon. And so this is basically theta plus is equal to theta except for theta plus i, which is now incremented by epsilon. So if theta plus is equal to, right, theta 1, theta 2, and so on and then theta i has epsilon added to it, and then it go down to theta n. So this is what theta plus is. And similarly these two lines set theta minus to something similar except that this, instead of theta i plus epsilon, this now becomes theta i minus epsilon. And then finally, you implement this gradApprox i, and this will give you your approximation to the partial derivative with respect to theta i of J of theta. And the way we use this in our neural network implementation is we would implement this, implement this FOR loop to compute, you know, the top partial derivative of the cost function with respect to every parameter in our network. And we can then take the gradient that we got from back prop. So DVec was the derivatives we got from back prop. Right, so back prop, back-propagation was a relatively efficient way to compute the derivatives or the partial derivatives of a cost function with respect to all of our parameters. And what I usually do is then take my numerically computed derivative, that is this gradApprox that we just had from up here and make sure that that is equal or approximately equal up to, you know, small values of numerical round off that is pretty close to the DVec that I got from back prop. And if these two ways of computing the derivative give me the same answer or at least give me very similar answers, you know, up to a few decimal places. Then I'm much more confident that my implementation of back prop is correct. And when I plug these DVec vectors into gradient descent or some advanced optimization algorithm, I can then be much more confident that I'm computing the derivatives correctly and therefore, that hopefully my codes will run correctly and do a good job optimizing J of theta. Finally, I want to put everything together and tell you how to implement this numerical gradient checking. Here's what I usually do. First thing I do, is implement back-propagation to compute defects. So, this is a procedure we talked about in an earlier video to compute DVec which may be our unrolled version of these matrices. Then what I do, is implement a numerical gradient checking to compute gradApprox. So this is what I described earlier in this video, in the previous slide. Then you should make sure that DVec and gradApprox gives similar values, you know, let's say up to a few decimal places. And finally, and this the important step, the more you start to use your code for learning, for seriously training your network, it is important to turn off gradient checking. And to no longer compute this gradApprox thing using the numerical derivative formulas that we talked about earlier in this video. And the reason for that is the numeric code gradient checking code, the stuff we talked about in this video, that's a very computationally expensive, that's a very slow way to try to approximate the derivative. Whereas in contrast, the back-propagation algorithm that we talked about earlier, that is the thing that we talked about earlier for computing, you know, D1, D2, D3, or for DVec. Back prop is a much more computationally efficient way of computing the derivatives. So once you've verified that your implementation of back-propagation is correct, you should turn off gradient checking, and just stop using that. So just to reiterate, you should be sure to disable your gradient checking code before running your algorithm for many iterations of gradient descent, or for many iterations of the advanced optimization algorithms in order to train your classifier. Concretely, if you were to run numerical gradient checking on every single integration of gradient descent, or if you were in the inner loop of your cost function, then your code will be very slow. Because the numerical gradient checking code is much slower than the back-propagation algorithm, than a back-propagation method where you remember we were computing delta 4, delta 3, delta 2, and so on. That was the back-propagation algorithm. That is a much faster way to compute derivatives than gradient checking. So when you're ready, once you verify the implementation of back-propagation is correct, make sure you turn off, or you disable your gradient checking code while you train your algorithm, or else your code could run very slowly. So that's how you take gradients numerically. And that's how you can verify that your implementation of back-propagation is correct. Whenever I implement back-propagation or similar gradient descent algorithm for a complicated model, I always use gradient checking. This really helps me make sure that my code is correct.