1 00:00:00,610 --> 00:00:03,220 In the last video you learned about gradient checking. 2 00:00:03,220 --> 00:00:06,317 In this video, I want to share with you some practical tips or 3 00:00:06,317 --> 00:00:10,950 some notes on how to actually go about implementing this for your neural network. 4 00:00:10,950 --> 00:00:14,590 First, don't use grad check in training, only to debug. 5 00:00:14,590 --> 00:00:19,470 So what I mean is that, computing d theta approx i, for 6 00:00:19,470 --> 00:00:22,520 all the values of i, this is a very slow computation. 7 00:00:22,520 --> 00:00:26,320 So to implement gradient descent, you'd use backprop to compute d theta and 8 00:00:26,320 --> 00:00:29,110 just use backprop to compute the derivative. 9 00:00:29,110 --> 00:00:32,140 And it's only when you're debugging that you would compute this 10 00:00:32,140 --> 00:00:34,218 to make sure it's close to d theta. 11 00:00:34,218 --> 00:00:37,048 But once you've done that, then you would turn off the grad check, and 12 00:00:37,048 --> 00:00:39,502 don't run this during every iteration of gradient descent, 13 00:00:39,502 --> 00:00:41,530 because that's just much too slow. 14 00:00:41,530 --> 00:00:45,060 Second, if an algorithm fails grad check, look at the components, 15 00:00:45,060 --> 00:00:48,010 look at the individual components, and try to identify the bug. 16 00:00:48,010 --> 00:00:52,124 So what I mean by that is if d theta approx is very far from d theta, 17 00:00:52,124 --> 00:00:57,079 what I would do is look at the different values of i to see which are the values of 18 00:00:57,079 --> 00:01:02,360 d theta approx that are really very different than the values of d theta. 19 00:01:02,360 --> 00:01:06,842 So for example, if you find that the values of theta or d theta, 20 00:01:06,842 --> 00:01:11,495 they're very far off, all correspond to dbl for some layer or for 21 00:01:11,495 --> 00:01:16,162 some layers, but the components for dw are quite close, right? 22 00:01:16,162 --> 00:01:20,803 Remember, different components of theta correspond to different components 23 00:01:20,803 --> 00:01:21,434 of b and w. 24 00:01:21,434 --> 00:01:25,918 When you find this is the case, then maybe you find that the bug is in how 25 00:01:25,918 --> 00:01:30,411 you're computing db, the derivative with respect to parameters b. 26 00:01:30,411 --> 00:01:35,495 And similarly, vice versa, if you find that the values that are very far, 27 00:01:35,495 --> 00:01:39,610 the values from d theta approx that are very far from d theta, 28 00:01:39,610 --> 00:01:44,452 you find all those components came from dw or from dw in a certain layer, 29 00:01:44,452 --> 00:01:48,455 then that might help you hone in on the location of the bug. 30 00:01:48,455 --> 00:01:51,562 This doesn't always let you identify the bug right away, but 31 00:01:51,562 --> 00:01:55,622 sometimes it helps you give you some guesses about where to track down the bug. 32 00:01:56,782 --> 00:01:59,502 Next, when doing grad check, 33 00:01:59,502 --> 00:02:03,372 remember your regularization term if you're using regularization. 34 00:02:03,372 --> 00:02:10,052 So if your cost function is J of theta equals 1 over m sum of your 35 00:02:10,052 --> 00:02:15,570 losses and then plus this regularization term. 36 00:02:15,570 --> 00:02:22,790 And sum over l of wl squared, then this is the definition of J. 37 00:02:22,790 --> 00:02:27,200 And you should have that d theta is gradient of J with 38 00:02:27,200 --> 00:02:30,840 respect to theta, including this regularization term. 39 00:02:30,840 --> 00:02:32,880 So just remember to include that term. 40 00:02:32,880 --> 00:02:37,185 Next, grad check doesn't work with dropout, because in every iteration, 41 00:02:37,185 --> 00:02:41,307 dropout is randomly eliminating different subsets of the hidden units. 42 00:02:41,307 --> 00:02:45,923 There isn't an easy to compute cost function J that dropout is 43 00:02:45,923 --> 00:02:48,098 doing gradient descent on. 44 00:02:48,098 --> 00:02:52,932 It turns out that dropout can be viewed as optimizing some cost function J, but 45 00:02:52,932 --> 00:02:57,254 it's cost function J defined by summing over all exponentially large 46 00:02:57,254 --> 00:03:00,900 subsets of nodes they could eliminate in any iteration. 47 00:03:00,900 --> 00:03:04,780 So the cost function J is very difficult to compute, and 48 00:03:04,780 --> 00:03:07,560 you're just sampling the cost function 49 00:03:07,560 --> 00:03:11,770 every time you eliminate different random subsets in those we use dropout. 50 00:03:11,770 --> 00:03:14,730 So it's difficult to use grad check to double check your 51 00:03:14,730 --> 00:03:16,810 computation with dropouts. 52 00:03:16,810 --> 00:03:20,360 So what I usually do is implement grad check without dropout. 53 00:03:20,360 --> 00:03:25,285 So if you want, you can set keep-prob and dropout to be equal to 1.0. 54 00:03:25,285 --> 00:03:29,590 And then turn on dropout and hope that my implementation of dropout was correct. 55 00:03:30,770 --> 00:03:35,738 There are some other things you could do, like fix the pattern of nodes dropped and 56 00:03:35,738 --> 00:03:39,914 verify that grad check for that pattern of [INAUDIBLE] is correct, but 57 00:03:39,914 --> 00:03:43,200 in practice I don't usually do that. 58 00:03:43,200 --> 00:03:48,010 So my recommendation is turn off dropout, use grad check to double check that your 59 00:03:48,010 --> 00:03:52,560 algorithm is at least correct without dropout, and then turn on dropout. 60 00:03:52,560 --> 00:03:55,520 Finally, this is a subtlety. 61 00:03:55,520 --> 00:03:59,853 It is not impossible, rarely happens, but it's not impossible that your 62 00:03:59,853 --> 00:04:04,322 implementation of gradient descent is correct when w and b are close to 0, so 63 00:04:04,322 --> 00:04:06,500 at random initialization. 64 00:04:06,500 --> 00:04:10,223 But that as you run gradient descent and w and b become bigger, 65 00:04:10,223 --> 00:04:15,089 maybe your implementation of backprop is correct only when w and b is close to 0, 66 00:04:15,089 --> 00:04:18,660 but it gets more inaccurate when w and b become large. 67 00:04:18,660 --> 00:04:21,510 So one thing you could do, I don't do this very often, 68 00:04:21,510 --> 00:04:25,850 but one thing you could do is run grad check at random initialization and 69 00:04:25,850 --> 00:04:27,940 then train the network for a while so that w and 70 00:04:27,940 --> 00:04:33,198 b have some time to wander away from 0, from your small random initial values. 71 00:04:33,198 --> 00:04:37,620 And then run grad check again after you've trained for some number of iterations. 72 00:04:37,620 --> 00:04:39,165 So that's it for gradient checking. 73 00:04:39,165 --> 00:04:42,760 And congratulations for coming to the end of this week's materials. 74 00:04:42,760 --> 00:04:47,100 In this week, you've learned about how to set up your train, dev, and test sets, 75 00:04:47,100 --> 00:04:51,254 how to analyze bias and variance and what things to do if you have high bias versus 76 00:04:51,254 --> 00:04:54,230 high variance versus maybe high bias and high variance. 77 00:04:54,230 --> 00:04:57,930 You also saw how to apply different forms of regularization, 78 00:04:57,930 --> 00:05:02,070 like L2 regularization and dropout on your neural network. 79 00:05:02,070 --> 00:05:05,318 So some tricks for speeding up the training of your neural network. 80 00:05:05,318 --> 00:05:07,920 And then finally, gradient checking. 81 00:05:07,920 --> 00:05:10,480 So I think you've seen a lot in this week and 82 00:05:10,480 --> 00:05:14,300 you get to exercise a lot of these ideas in this week's programming exercise. 83 00:05:14,300 --> 00:05:15,520 So best of luck with that, and 84 00:05:15,520 --> 00:05:17,830 I look forward to seeing you in the week two materials.