In the last video you learned
about gradient checking. In this video, I want to share
with you some practical tips or some notes on how to actually go about
implementing this for your neural network. First, don't use grad check in training,
only to debug. So what I mean is that,
computing d theta approx i, for all the values of i,
this is a very slow computation. So to implement gradient descent,
you'd use backprop to compute d theta and just use backprop to
compute the derivative. And it's only when you're debugging
that you would compute this to make sure it's close to d theta. But once you've done that, then you
would turn off the grad check, and don't run this during every
iteration of gradient descent, because that's just much too slow. Second, if an algorithm fails grad check,
look at the components, look at the individual components,
and try to identify the bug. So what I mean by that is if d theta
approx is very far from d theta, what I would do is look at the different
values of i to see which are the values of d theta approx that are really very
different than the values of d theta. So for example, if you find that
the values of theta or d theta, they're very far off, all correspond
to dbl for some layer or for some layers, but the components for
dw are quite close, right? Remember, different components of theta
correspond to different components of b and w. When you find this is the case,
then maybe you find that the bug is in how you're computing db, the derivative
with respect to parameters b. And similarly, vice versa, if you find
that the values that are very far, the values from d theta approx
that are very far from d theta, you find all those components came
from dw or from dw in a certain layer, then that might help you hone
in on the location of the bug. This doesn't always let you
identify the bug right away, but sometimes it helps you give you some
guesses about where to track down the bug. Next, when doing grad check, remember your regularization term
if you're using regularization. So if your cost function is J of
theta equals 1 over m sum of your losses and
then plus this regularization term. And sum over l of wl squared,
then this is the definition of J. And you should have that d
theta is gradient of J with respect to theta,
including this regularization term. So just remember to include that term. Next, grad check doesn't work with
dropout, because in every iteration, dropout is randomly eliminating
different subsets of the hidden units. There isn't an easy to compute
cost function J that dropout is doing gradient descent on. It turns out that dropout can be viewed
as optimizing some cost function J, but it's cost function J defined by
summing over all exponentially large subsets of nodes they could
eliminate in any iteration. So the cost function J is very
difficult to compute, and you're just sampling the cost function every time you eliminate different
random subsets in those we use dropout. So it's difficult to use grad
check to double check your computation with dropouts. So what I usually do is implement
grad check without dropout. So if you want, you can set keep-prob and
dropout to be equal to 1.0. And then turn on dropout and hope that my
implementation of dropout was correct. There are some other things you could do,
like fix the pattern of nodes dropped and verify that grad check for that
pattern of [INAUDIBLE] is correct, but in practice I don't usually do that. So my recommendation is turn off dropout,
use grad check to double check that your algorithm is at least correct without
dropout, and then turn on dropout. Finally, this is a subtlety. It is not impossible, rarely happens,
but it's not impossible that your implementation of gradient descent is
correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and
w and b become bigger, maybe your implementation of backprop is
correct only when w and b is close to 0, but it gets more inaccurate when w and
b become large. So one thing you could do,
I don't do this very often, but one thing you could do is run grad
check at random initialization and then train the network for
a while so that w and b have some time to wander away from 0,
from your small random initial values. And then run grad check again after you've
trained for some number of iterations. So that's it for gradient checking. And congratulations for coming to
the end of this week's materials. In this week, you've learned about how to
set up your train, dev, and test sets, how to analyze bias and variance and what
things to do if you have high bias versus high variance versus maybe high bias and
high variance. You also saw how to apply
different forms of regularization, like L2 regularization and
dropout on your neural network. So some tricks for speeding up
the training of your neural network. And then finally, gradient checking. So I think you've seen
a lot in this week and you get to exercise a lot of these ideas
in this week's programming exercise. So best of luck with that, and I look forward to seeing you
in the week two materials.