1 00:00:00,400 --> 00:00:04,000 Gradient checking is a technique that's helped me save tons of time, and 2 00:00:04,000 --> 00:00:08,500 helped me find bugs in my implementations of back propagation many times. 3 00:00:08,500 --> 00:00:10,890 Let's see how you could use it too to debug, or 4 00:00:10,890 --> 00:00:14,885 to verify that your implementation and back process correct. 5 00:00:14,885 --> 00:00:20,975 So your new network will have some sort of parameters, W1, B1 and so on up to WL bL. 6 00:00:20,975 --> 00:00:23,935 So to implement gradient checking, the first thing you should do is take all your 7 00:00:23,935 --> 00:00:28,835 parameters and reshape them into a giant vector data. 8 00:00:28,835 --> 00:00:34,860 So what you should do is take W which is a matrix, and reshape it into a vector. 9 00:00:34,860 --> 00:00:39,850 You gotta take all of these Ws and reshape them into vectors, and then concatenate 10 00:00:39,850 --> 00:00:45,170 all of these things, so that you have a giant vector theta. 11 00:00:45,170 --> 00:00:47,020 Giant vector pronounced as theta. 12 00:00:47,020 --> 00:00:52,720 So we say that the cos function J being a function of the Ws and 13 00:00:52,720 --> 00:00:58,380 Bs, You would now have the cost function J being just a function of theta. 14 00:00:58,380 --> 00:01:02,160 Next, with W and B ordered the same way, 15 00:01:02,160 --> 00:01:07,740 you can also take dW[1], db[1] and so on, and initiate them into big, 16 00:01:07,740 --> 00:01:12,200 giant vector d theta of the same dimension as theta. 17 00:01:12,200 --> 00:01:17,210 So same as before, we shape dW[1] into the matrix, db[1] is already a vector. 18 00:01:17,210 --> 00:01:21,220 We shape dW[L], all of the dW's which are matrices. 19 00:01:21,220 --> 00:01:24,632 Remember, dW1 has the same dimension as W1. 20 00:01:24,632 --> 00:01:27,080 db1 has the same dimension as b1. 21 00:01:27,080 --> 00:01:31,252 So the same sort of reshaping and concatenation operation, 22 00:01:31,252 --> 00:01:36,343 you can then reshape all of these derivatives into a giant vector d theta. 23 00:01:36,343 --> 00:01:38,750 Which has the same dimension as theta. 24 00:01:38,750 --> 00:01:43,780 So the question is, now, is the theta the gradient or 25 00:01:43,780 --> 00:01:47,310 the slope of the cos function J? 26 00:01:47,310 --> 00:01:49,620 So here's how you implement gradient checking, and 27 00:01:49,620 --> 00:01:52,740 often abbreviate gradient checking to grad check. 28 00:01:52,740 --> 00:01:57,690 So first we remember that J Is now a function of the giant parameter, 29 00:01:57,690 --> 00:01:58,277 theta, right? 30 00:01:58,277 --> 00:02:04,750 So expands to j is a function of theta 1, theta 2, theta 3, and so on. 31 00:02:06,880 --> 00:02:11,618 Whatever's the dimension of this giant parameter vector theta. 32 00:02:11,618 --> 00:02:18,519 So to implement grad check, what you're going to do is implements a loop so 33 00:02:18,519 --> 00:02:23,008 that for each I, so for each component of theta, 34 00:02:23,008 --> 00:02:26,416 let's compute D theta approx i to b. 35 00:02:26,416 --> 00:02:28,170 And let me take a two sided difference. 36 00:02:28,170 --> 00:02:30,100 So I'll take J of theta. 37 00:02:30,100 --> 00:02:34,440 Theta 1, theta 2, up to theta i. 38 00:02:34,440 --> 00:02:38,380 And we're going to nudge theta i to add epsilon to this. 39 00:02:38,380 --> 00:02:42,970 So just increase theta i by epsilon, and keep everything else the same. 40 00:02:42,970 --> 00:02:46,164 And because we're taking a two sided difference, 41 00:02:46,164 --> 00:02:51,226 we're going to do the same on the other side with theta i, but now minus epsilon. 42 00:02:51,226 --> 00:02:54,520 And then all of the other elements of theta are left alone. 43 00:02:54,520 --> 00:02:59,690 And then we'll take this, and we'll divide it by 2 theta. 44 00:02:59,690 --> 00:03:04,772 And what we saw from the previous video is that 45 00:03:04,772 --> 00:03:10,270 this should be approximately equal to d theta i. 46 00:03:10,270 --> 00:03:15,609 Of which is supposed to be the partial derivative of J or of respect to, 47 00:03:15,609 --> 00:03:21,320 I guess theta i, if d theta i is the derivative of the cost function J. 48 00:03:21,320 --> 00:03:25,130 So what you going to do is you're going to compute to this for every value of i. 49 00:03:25,130 --> 00:03:28,360 And at the end, you now end up with two vectors. 50 00:03:28,360 --> 00:03:31,793 You end up with this d theta approx, and 51 00:03:31,793 --> 00:03:35,860 this is going to be the same dimension as d theta. 52 00:03:35,860 --> 00:03:39,373 And both of these are in turn the same dimension as theta. 53 00:03:39,373 --> 00:03:43,183 And what you want to do is check if these vectors are approximately equal to 54 00:03:43,183 --> 00:03:44,130 each other. 55 00:03:44,130 --> 00:03:47,310 So, in detail, well how you do you define whether or 56 00:03:47,310 --> 00:03:50,910 not two vectors are really reasonably close to each other? 57 00:03:50,910 --> 00:03:52,593 What I do is the following. 58 00:03:52,593 --> 00:03:57,297 I would compute the distance between these two vectors, 59 00:03:57,297 --> 00:04:02,100 d theta approx minus d theta, so just the o2 norm of this. 60 00:04:02,100 --> 00:04:03,851 Notice there's no square on top, so 61 00:04:03,851 --> 00:04:06,788 this is the sum of squares of elements of the differences, and 62 00:04:06,788 --> 00:04:09,857 then you take a square root, as you get the Euclidean distance. 63 00:04:09,857 --> 00:04:15,512 And then just to normalize by the lengths of these vectors, 64 00:04:15,512 --> 00:04:19,150 divide by d theta approx plus d theta. 65 00:04:19,150 --> 00:04:22,620 Just take the Euclidean lengths of these vectors. 66 00:04:22,620 --> 00:04:28,044 And the row for the denominator is just in case any of these vectors are really small 67 00:04:28,044 --> 00:04:32,860 or really large, your the denominator turns this formula into a ratio. 68 00:04:32,860 --> 00:04:35,202 So we implement this in practice, 69 00:04:35,202 --> 00:04:39,898 I use epsilon equals maybe 10 to the minus 7, so minus 7. 70 00:04:39,898 --> 00:04:44,644 And with this range of epsilon, if you find that this formula gives you 71 00:04:44,644 --> 00:04:49,460 a value like 10 to the minus 7 or smaller, then that's great. 72 00:04:49,460 --> 00:04:53,302 It means that your derivative approximation is very likely correct. 73 00:04:53,302 --> 00:04:55,330 This is just a very small value. 74 00:04:55,330 --> 00:05:00,790 If it's maybe on the range of 10 to the -5, I would take a careful look. 75 00:05:00,790 --> 00:05:02,148 Maybe this is okay. 76 00:05:02,148 --> 00:05:05,239 But I might double-check the components of this vector, and 77 00:05:05,239 --> 00:05:07,862 make sure that none of the components are too large. 78 00:05:07,862 --> 00:05:10,649 And if some of the components of this difference are very large, 79 00:05:10,649 --> 00:05:12,860 then maybe you have a bug somewhere. 80 00:05:12,860 --> 00:05:17,719 And if this formula on the left is on the other is -3, then I would wherever you 81 00:05:17,719 --> 00:05:21,728 have would be much more concerned that maybe there's a bug somewhere. 82 00:05:21,728 --> 00:05:25,083 But you should really be getting values much smaller then 10 minus 3. 83 00:05:25,083 --> 00:05:29,690 If any bigger than 10 to minus 3, then I would be quite concerned. 84 00:05:29,690 --> 00:05:32,970 I would be seriously worried that there might be a bug. 85 00:05:32,970 --> 00:05:37,204 And I would then, you should then look at the individual 86 00:05:37,204 --> 00:05:41,799 components of data to see if there's a specific value of i for 87 00:05:41,799 --> 00:05:45,960 which d theta across i is very different from d theta i. 88 00:05:45,960 --> 00:05:47,867 And use that to try to track down whether or 89 00:05:47,867 --> 00:05:51,040 not some of your derivative computations might be incorrect. 90 00:05:51,040 --> 00:05:54,970 And after some amounts of debugging, it finally, it ends up being this 91 00:05:54,970 --> 00:05:59,820 kind of very small value, then you probably have a correct implementation. 92 00:05:59,820 --> 00:06:01,320 So when implementing a neural network, 93 00:06:01,320 --> 00:06:04,840 what often happens is I'll implement foreprop, implement backprop. 94 00:06:04,840 --> 00:06:08,612 And then I might find that this grad check has a relatively big value. 95 00:06:08,612 --> 00:06:12,460 And then I will suspect that there must be a bug, go in debug, debug, debug. 96 00:06:12,460 --> 00:06:16,310 And after debugging for a while, If I find that it passes grad check with a small 97 00:06:16,310 --> 00:06:20,110 value, then you can be much more confident that it's then correct. 98 00:06:20,110 --> 00:06:22,310 So you now know how gradient checking works. 99 00:06:22,310 --> 00:06:24,850 This has helped me find lots of bugs in my implementations of neural nets, 100 00:06:24,850 --> 00:06:27,330 and I hope it'll help you too. 101 00:06:27,330 --> 00:06:29,970 In the next video, I want to share with you some tips or 102 00:06:29,970 --> 00:06:33,490 some notes on how to actually implement gradient checking. 103 00:06:33,490 --> 00:06:34,640 Let's go onto the next video.