1 00:00:00,930 --> 00:00:04,720 When you implement back propagation you'll find that there's a test called 2 00:00:04,720 --> 00:00:07,700 creating checking that can really help you make sure 3 00:00:07,700 --> 00:00:10,710 that your implementation of back prop is correct. 4 00:00:10,710 --> 00:00:14,376 Because sometimes you write all these equations and you're just not 100% sure if 5 00:00:14,376 --> 00:00:17,940 you've got all the details right and internal back propagation. 6 00:00:17,940 --> 00:00:21,020 So in order to build up to gradient and checking, 7 00:00:21,020 --> 00:00:25,490 let's first talk about how to numerically approximate computations of gradients and 8 00:00:25,490 --> 00:00:28,400 in the next video, we'll talk about how you can implement 9 00:00:28,400 --> 00:00:32,028 gradient checking to make sure the implementation of backdrop is correct. 10 00:00:32,028 --> 00:00:37,310 So lets take the function f and replot it here and remember this is 11 00:00:37,310 --> 00:00:43,110 f of theta equals theta cubed, and let's again start off to some value of theta. 12 00:00:43,110 --> 00:00:44,640 Let's say theta equals 1. 13 00:00:44,640 --> 00:00:50,180 Now instead of just nudging theta to the right to get theta plus epsilon, 14 00:00:50,180 --> 00:00:52,460 we're going to nudge it to the right and 15 00:00:52,460 --> 00:00:58,110 nudge it to the left to get theta minus epsilon, as was theta plus epsilon. 16 00:00:58,110 --> 00:01:02,935 So this is 1, this is 1.01, this is 0.99 where, again, 17 00:01:02,935 --> 00:01:06,144 epsilon is same as before, it is 0.01. 18 00:01:06,144 --> 00:01:10,378 It turns out that rather than taking this little triangle and 19 00:01:10,378 --> 00:01:15,526 computing the height over the width, you can get a much better estimate of 20 00:01:15,526 --> 00:01:20,922 the gradient if you take this point, f of theta minus epsilon and this point, 21 00:01:20,922 --> 00:01:26,230 and you instead compute the height over width of this bigger triangle. 22 00:01:26,230 --> 00:01:31,988 So for technical reasons which I won't go into, the height over width of this bigger 23 00:01:31,988 --> 00:01:37,601 green triangle gives you a much better approximation to the derivative at theta. 24 00:01:37,601 --> 00:01:41,338 And you saw it yourself, taking just this lower triangle in the upper right 25 00:01:41,338 --> 00:01:43,372 is as if you have two triangles, right? 26 00:01:43,372 --> 00:01:47,220 This one on the upper right and this one on the lower left. 27 00:01:47,220 --> 00:01:49,760 And you're kind of taking both of them into account 28 00:01:49,760 --> 00:01:54,450 by using this bigger green triangle. 29 00:01:54,450 --> 00:01:57,720 So rather than a one sided difference, you're taking a two sided difference. 30 00:01:57,720 --> 00:01:58,954 So let's work out the math. 31 00:01:58,954 --> 00:02:03,648 This point here is F of theta plus epsilon. 32 00:02:03,648 --> 00:02:07,870 This point here is F of theta minus epsilon. 33 00:02:07,870 --> 00:02:12,390 So the height of this big green triangle is f of theta plus epsilon 34 00:02:12,390 --> 00:02:15,230 minus f of theta minus epsilon. 35 00:02:15,230 --> 00:02:21,250 And then the width, this is 1 epsilon, this is 2 epsilon. 36 00:02:21,250 --> 00:02:24,390 So the width of this green triangle is 2 epsilon. 37 00:02:24,390 --> 00:02:28,400 So the height of the width is going to be first the height, so 38 00:02:28,400 --> 00:02:35,110 that's F of theta plus epsilon minus F of theta minus epsilon divided by the width. 39 00:02:35,110 --> 00:02:37,920 So that was 2 epsilon which we write that down here. 40 00:02:38,950 --> 00:02:43,450 And this should hopefully be close to g of theta. 41 00:02:43,450 --> 00:02:46,350 So plug in the values, remember f of theta is theta cubed. 42 00:02:46,350 --> 00:02:49,890 So this is theta plus epsilon is 1.01. 43 00:02:49,890 --> 00:02:58,250 So I take a cube of that minus 0.99 theta cube of that divided by 2 times 0.01. 44 00:02:58,250 --> 00:03:03,580 Feel free to pause the video and practice in the calculator. 45 00:03:03,580 --> 00:03:06,259 You should get that this is 3.0001. 46 00:03:06,259 --> 00:03:10,581 Whereas from the previous slide, we saw that g of theta, 47 00:03:10,581 --> 00:03:14,272 this was 3 theta squared so when theta was 1, so 48 00:03:14,272 --> 00:03:18,519 these two values are actually very close to each other. 49 00:03:18,519 --> 00:03:22,250 The approximation error is now 0.0001. 50 00:03:22,250 --> 00:03:27,456 Whereas on the previous slide, we've taken the one sided 51 00:03:27,456 --> 00:03:34,150 of difference just theta + theta + epsilon we had gotten 3.0301 and 52 00:03:34,150 --> 00:03:40,340 so the approximation error was 0.03 rather than 0.0001. 53 00:03:40,340 --> 00:03:44,260 So this two sided difference way of 54 00:03:44,260 --> 00:03:48,462 approximating the derivative you find that this is extremely close to 3. 55 00:03:48,462 --> 00:03:53,320 And so this gives you a much greater confidence that g of theta is 56 00:03:53,320 --> 00:03:56,890 probably a correct implementation of the derivative of F. 57 00:03:58,220 --> 00:04:01,480 When you use this method for grading, checking and back propagation, 58 00:04:01,480 --> 00:04:06,230 this turns out to run twice as slow as you were to use a one-sided defense. 59 00:04:06,230 --> 00:04:10,193 It turns out that in practice I think it's worth it to use this other method because 60 00:04:10,193 --> 00:04:11,752 it's just much more accurate. 61 00:04:11,752 --> 00:04:13,946 The little bit of optional theory for 62 00:04:13,946 --> 00:04:18,685 those of you that are a little bit more familiar of Calculus, it turns out that, 63 00:04:18,685 --> 00:04:22,249 and it's okay if you don't get what I'm about to say here. 64 00:04:22,249 --> 00:04:26,772 But it turns out that the formal definition of a derivative is for 65 00:04:26,772 --> 00:04:31,629 very small values of epsilon is f of theta plus epsilon minus f of theta 66 00:04:31,629 --> 00:04:33,917 minus epsilon over 2 epsilon. 67 00:04:33,917 --> 00:04:38,852 And the formal definition of derivative is in the limits of exactly 68 00:04:38,852 --> 00:04:42,480 that formula on the right as epsilon those as 0. 69 00:04:42,480 --> 00:04:46,270 And the definition of unlimited is something that you learned if you 70 00:04:46,270 --> 00:04:48,980 took a Calculus class but I won't go into that here. 71 00:04:48,980 --> 00:04:52,398 And it turns out that for a non zero value of epsilon, 72 00:04:52,398 --> 00:04:56,517 you can show that the error of this approximation is on the order 73 00:04:56,517 --> 00:05:00,889 of epsilon squared, and remember epsilon is a very small number. 74 00:05:00,889 --> 00:05:08,471 So if epsilon is 0.01 which it is here then epsilon squared is 0.0001. 75 00:05:08,471 --> 00:05:12,098 The big O notation means the error is actually some constant times this, but 76 00:05:12,098 --> 00:05:15,240 this is actually exactly our approximation error. 77 00:05:15,240 --> 00:05:17,478 So the big O constant happens to be 1. 78 00:05:17,478 --> 00:05:22,182 Whereas in contrast if we were to use this formula, the other one, 79 00:05:22,182 --> 00:05:25,129 then the error is on the order of epsilon. 80 00:05:25,129 --> 00:05:29,872 And again, when epsilon is a number less than 1, then epsilon is actually 81 00:05:29,872 --> 00:05:34,618 much bigger than epsilon squared which is why this formula here is actually 82 00:05:34,618 --> 00:05:38,790 much less accurate approximation than this formula on the left. 83 00:05:38,790 --> 00:05:43,690 Which is why when doing gradient checking, we rather use this two-sided difference 84 00:05:43,690 --> 00:05:48,113 when you compute f of theta plus epsilon minus f of theta minus epsilon and then 85 00:05:48,113 --> 00:05:52,900 divide by 2 epsilon rather than just one sided difference which is less accurate. 86 00:05:53,980 --> 00:05:57,090 If you didn't understand my last two comments, all of these things are on here. 87 00:05:57,090 --> 00:05:58,480 Don't worry about it. 88 00:05:58,480 --> 00:06:02,460 That's really more for those of you that are a bit more familiar with Calculus, and 89 00:06:02,460 --> 00:06:04,630 with numerical approximations. 90 00:06:04,630 --> 00:06:08,890 But the takeaway is that this two-sided difference formula is much more accurate. 91 00:06:08,890 --> 00:06:12,445 And so that's what we're going to use when we do gradient checking in the next video. 92 00:06:13,725 --> 00:06:16,355 So you've seen how by taking a two sided difference, 93 00:06:16,355 --> 00:06:20,845 you can numerically verify whether or not a function g, g of theta that someone 94 00:06:20,845 --> 00:06:25,675 else gives you is a correct implementation of the derivative of a function f. 95 00:06:25,675 --> 00:06:28,265 Let's now see how we can use this to verify whether or 96 00:06:28,265 --> 00:06:31,435 not your back propagation implementation is correct or 97 00:06:31,435 --> 00:06:34,855 if there might be a bug in there that you need to go and tease out