1
00:00:00,400 --> 00:00:04,000
Gradient checking is a technique that's
helped me save tons of time, and

2
00:00:04,000 --> 00:00:08,500
helped me find bugs in my implementations
of back propagation many times.

3
00:00:08,500 --> 00:00:10,890
Let's see how you could
use it too to debug, or

4
00:00:10,890 --> 00:00:14,885
to verify that your implementation and
back process correct.

5
00:00:14,885 --> 00:00:20,975
So your new network will have some sort of
parameters, W1, B1 and so on up to WL bL.

6
00:00:20,975 --> 00:00:23,935
So to implement gradient checking, the
first thing you should do is take all your

7
00:00:23,935 --> 00:00:28,835
parameters and
reshape them into a giant vector data.

8
00:00:28,835 --> 00:00:34,860
So what you should do is take W which is
a matrix, and reshape it into a vector.

9
00:00:34,860 --> 00:00:39,850
You gotta take all of these Ws and reshape
them into vectors, and then concatenate

10
00:00:39,850 --> 00:00:45,170
all of these things, so
that you have a giant vector theta.

11
00:00:45,170 --> 00:00:47,020
Giant vector pronounced as theta.

12
00:00:47,020 --> 00:00:52,720
So we say that the cos function
J being a function of the Ws and

13
00:00:52,720 --> 00:00:58,380
Bs, You would now have the cost function
J being just a function of theta.

14
00:00:58,380 --> 00:01:02,160
Next, with W and B ordered the same way,

15
00:01:02,160 --> 00:01:07,740
you can also take dW[1], db[1] and
so on, and initiate them into big,

16
00:01:07,740 --> 00:01:12,200
giant vector d theta of
the same dimension as theta.

17
00:01:12,200 --> 00:01:17,210
So same as before, we shape dW[1] into
the matrix, db[1] is already a vector.

18
00:01:17,210 --> 00:01:21,220
We shape dW[L],
all of the dW's which are matrices.

19
00:01:21,220 --> 00:01:24,632
Remember, dW1 has
the same dimension as W1.

20
00:01:24,632 --> 00:01:27,080
db1 has the same dimension as b1.

21
00:01:27,080 --> 00:01:31,252
So the same sort of reshaping and
concatenation operation,

22
00:01:31,252 --> 00:01:36,343
you can then reshape all of these
derivatives into a giant vector d theta.

23
00:01:36,343 --> 00:01:38,750
Which has the same dimension as theta.

24
00:01:38,750 --> 00:01:43,780
So the question is, now,
is the theta the gradient or

25
00:01:43,780 --> 00:01:47,310
the slope of the cos function J?

26
00:01:47,310 --> 00:01:49,620
So here's how you implement
gradient checking, and

27
00:01:49,620 --> 00:01:52,740
often abbreviate gradient
checking to grad check.

28
00:01:52,740 --> 00:01:57,690
So first we remember that J Is now
a function of the giant parameter,

29
00:01:57,690 --> 00:01:58,277
theta, right?

30
00:01:58,277 --> 00:02:04,750
So expands to j is a function of theta 1,
theta 2, theta 3, and so on.

31
00:02:06,880 --> 00:02:11,618
Whatever's the dimension of this
giant parameter vector theta.

32
00:02:11,618 --> 00:02:18,519
So to implement grad check, what you're
going to do is implements a loop so

33
00:02:18,519 --> 00:02:23,008
that for each I, so for
each component of theta,

34
00:02:23,008 --> 00:02:26,416
let's compute D theta approx i to b.

35
00:02:26,416 --> 00:02:28,170
And let me take a two sided difference.

36
00:02:28,170 --> 00:02:30,100
So I'll take J of theta.

37
00:02:30,100 --> 00:02:34,440
Theta 1, theta 2, up to theta i.

38
00:02:34,440 --> 00:02:38,380
And we're going to nudge theta
i to add epsilon to this.

39
00:02:38,380 --> 00:02:42,970
So just increase theta i by epsilon,
and keep everything else the same.

40
00:02:42,970 --> 00:02:46,164
And because we're taking
a two sided difference,

41
00:02:46,164 --> 00:02:51,226
we're going to do the same on the other
side with theta i, but now minus epsilon.

42
00:02:51,226 --> 00:02:54,520
And then all of the other
elements of theta are left alone.

43
00:02:54,520 --> 00:02:59,690
And then we'll take this, and
we'll divide it by 2 theta.

44
00:02:59,690 --> 00:03:04,772
And what we saw from
the previous video is that

45
00:03:04,772 --> 00:03:10,270
this should be approximately
equal to d theta i.

46
00:03:10,270 --> 00:03:15,609
Of which is supposed to be the partial
derivative of J or of respect to,

47
00:03:15,609 --> 00:03:21,320
I guess theta i, if d theta i is
the derivative of the cost function J.

48
00:03:21,320 --> 00:03:25,130
So what you going to do is you're going to
compute to this for every value of i.

49
00:03:25,130 --> 00:03:28,360
And at the end,
you now end up with two vectors.

50
00:03:28,360 --> 00:03:31,793
You end up with this d theta approx, and

51
00:03:31,793 --> 00:03:35,860
this is going to be the same
dimension as d theta.

52
00:03:35,860 --> 00:03:39,373
And both of these are in turn
the same dimension as theta.

53
00:03:39,373 --> 00:03:43,183
And what you want to do is check if
these vectors are approximately equal to

54
00:03:43,183 --> 00:03:44,130
each other.

55
00:03:44,130 --> 00:03:47,310
So, in detail,
well how you do you define whether or

56
00:03:47,310 --> 00:03:50,910
not two vectors are really
reasonably close to each other?

57
00:03:50,910 --> 00:03:52,593
What I do is the following.

58
00:03:52,593 --> 00:03:57,297
I would compute the distance
between these two vectors,

59
00:03:57,297 --> 00:04:02,100
d theta approx minus d theta,
so just the o2 norm of this.

60
00:04:02,100 --> 00:04:03,851
Notice there's no square on top, so

61
00:04:03,851 --> 00:04:06,788
this is the sum of squares of
elements of the differences, and

62
00:04:06,788 --> 00:04:09,857
then you take a square root,
as you get the Euclidean distance.

63
00:04:09,857 --> 00:04:15,512
And then just to normalize by
the lengths of these vectors,

64
00:04:15,512 --> 00:04:19,150
divide by d theta approx plus d theta.

65
00:04:19,150 --> 00:04:22,620
Just take the Euclidean
lengths of these vectors.

66
00:04:22,620 --> 00:04:28,044
And the row for the denominator is just in
case any of these vectors are really small

67
00:04:28,044 --> 00:04:32,860
or really large, your the denominator
turns this formula into a ratio.

68
00:04:32,860 --> 00:04:35,202
So we implement this in practice,

69
00:04:35,202 --> 00:04:39,898
I use epsilon equals maybe 10
to the minus 7, so minus 7.

70
00:04:39,898 --> 00:04:44,644
And with this range of epsilon,
if you find that this formula gives you

71
00:04:44,644 --> 00:04:49,460
a value like 10 to the minus 7 or
smaller, then that's great.

72
00:04:49,460 --> 00:04:53,302
It means that your derivative
approximation is very likely correct.

73
00:04:53,302 --> 00:04:55,330
This is just a very small value.

74
00:04:55,330 --> 00:05:00,790
If it's maybe on the range of 10 to
the -5, I would take a careful look.

75
00:05:00,790 --> 00:05:02,148
Maybe this is okay.

76
00:05:02,148 --> 00:05:05,239
But I might double-check
the components of this vector, and

77
00:05:05,239 --> 00:05:07,862
make sure that none of
the components are too large.

78
00:05:07,862 --> 00:05:10,649
And if some of the components of
this difference are very large,

79
00:05:10,649 --> 00:05:12,860
then maybe you have a bug somewhere.

80
00:05:12,860 --> 00:05:17,719
And if this formula on the left is on
the other is -3, then I would wherever you

81
00:05:17,719 --> 00:05:21,728
have would be much more concerned
that maybe there's a bug somewhere.

82
00:05:21,728 --> 00:05:25,083
But you should really be getting
values much smaller then 10 minus 3.

83
00:05:25,083 --> 00:05:29,690
If any bigger than 10 to minus 3,
then I would be quite concerned.

84
00:05:29,690 --> 00:05:32,970
I would be seriously worried
that there might be a bug.

85
00:05:32,970 --> 00:05:37,204
And I would then,
you should then look at the individual

86
00:05:37,204 --> 00:05:41,799
components of data to see if
there's a specific value of i for

87
00:05:41,799 --> 00:05:45,960
which d theta across i is very
different from d theta i.

88
00:05:45,960 --> 00:05:47,867
And use that to try to
track down whether or

89
00:05:47,867 --> 00:05:51,040
not some of your derivative
computations might be incorrect.

90
00:05:51,040 --> 00:05:54,970
And after some amounts of debugging,
it finally, it ends up being this

91
00:05:54,970 --> 00:05:59,820
kind of very small value, then you
probably have a correct implementation.

92
00:05:59,820 --> 00:06:01,320
So when implementing a neural network,

93
00:06:01,320 --> 00:06:04,840
what often happens is I'll implement
foreprop, implement backprop.

94
00:06:04,840 --> 00:06:08,612
And then I might find that this grad
check has a relatively big value.

95
00:06:08,612 --> 00:06:12,460
And then I will suspect that there must
be a bug, go in debug, debug, debug.

96
00:06:12,460 --> 00:06:16,310
And after debugging for a while, If I find
that it passes grad check with a small

97
00:06:16,310 --> 00:06:20,110
value, then you can be much more
confident that it's then correct.

98
00:06:20,110 --> 00:06:22,310
So you now know how
gradient checking works.

99
00:06:22,310 --> 00:06:24,850
This has helped me find lots of bugs
in my implementations of neural nets,

100
00:06:24,850 --> 00:06:27,330
and I hope it'll help you too.

101
00:06:27,330 --> 00:06:29,970
In the next video,
I want to share with you some tips or

102
00:06:29,970 --> 00:06:33,490
some notes on how to actually
implement gradient checking.

103
00:06:33,490 --> 00:06:34,640
Let's go onto the next video.