1
00:00:00,000 --> 00:00:05,697
In this video, I'm going to give a brief
overview of the Hessian-Free optimizer

2
00:00:05,697 --> 00:00:10,870
that can be used to train recurrent neural
networks very effectively.

3
00:00:10,870 --> 00:00:17,317
This is a very complicated optimizer and I
don't expect you to get all the details of

4
00:00:17,317 --> 00:00:21,965
it from this video.
I just want you to have a general feel for

5
00:00:21,965 --> 00:00:25,167
how it works.
And then in the next video, we will see

6
00:00:25,167 --> 00:00:27,622
how well it does on an interesting
problem.

7
00:00:27,622 --> 00:00:31,962
When we're training the weights of a
neural network, we are trying to get as

8
00:00:31,962 --> 00:00:36,814
far down the error surface as possible.
So one question is if we choose a given

9
00:00:36,814 --> 00:00:40,619
direction to go in,
How much reduction in the arrow can we

10
00:00:40,619 --> 00:00:44,489
achieve by going just the right distance
in that direction?

11
00:00:44,489 --> 00:00:48,622
How much does the arrow decrease before it
starts rising again?

12
00:00:48,622 --> 00:00:52,034
And here we'll assume that the curvature
is constant.

13
00:00:52,034 --> 00:00:55,380
I will assume it really is a quadratic
error surface.

14
00:00:55,820 --> 00:01:00,868
We can assume that the magnitude of the
gradient decreases as we move down the

15
00:01:00,868 --> 00:01:03,871
gradient.
That amounts to assuming that the error

16
00:01:03,871 --> 00:01:09,616
surface is concave upward like a bowl.
The maximum reduction that we can get in

17
00:01:09,616 --> 00:01:14,809
the error by going in a particular
direction depends on the ratio of the

18
00:01:14,809 --> 00:01:19,717
gradient to the curvature.
So we want to move in directions that have

19
00:01:19,717 --> 00:01:23,416
a good ratio.
Even if the gradient is quite small, we

20
00:01:23,416 --> 00:01:29,579
want the curvature to be even smaller.
So here's an example of a direction we

21
00:01:29,579 --> 00:01:34,417
could move in where the vertical axis
corresponds to the error, the horizontal

22
00:01:34,417 --> 00:01:39,255
axis corresponds to the weights in the
direction we're moving in, and the blue

23
00:01:39,255 --> 00:01:43,721
arrow corresponds to the reduction we get
if we start at that red point.

24
00:01:43,721 --> 00:01:48,373
Here's a surface that has a gentler
gradient but because it's got a better

25
00:01:48,373 --> 00:01:53,645
ratio as the gradient to the curvature, we
get a bigger reduction in the error by the

26
00:01:53,645 --> 00:01:58,769
time we get to the minimum.
The question is, how can we find

27
00:01:58,769 --> 00:02:03,125
directions like that second one?
Directions in which even though the

28
00:02:03,125 --> 00:02:06,200
gradient may be small, the curvature is
even smaller.

29
00:02:06,800 --> 00:02:11,294
So let's start with Newton's method.
Newton's method addresses the basic

30
00:02:11,294 --> 00:02:16,476
problem at its deepest descent, which is
that the gradient isn't the direction that

31
00:02:16,476 --> 00:02:21,611
you want to go in.
If the error surface has circular cross

32
00:02:21,611 --> 00:02:25,996
sections and is quadrant, the gradient is
a good direction to go.

33
00:02:25,996 --> 00:02:33,457
It will point straight at the minimum.
So the idea of Newton's method is to apply

34
00:02:33,457 --> 00:02:37,537
a linear transformation that turns
ellipses into circles.

35
00:02:37,537 --> 00:02:43,264
If that we apply that transformation to
the gradient vector, it will be as if we

36
00:02:43,264 --> 00:02:46,700
were going downhill in a circular error
surface.

37
00:02:47,160 --> 00:02:52,637
To do this, we need to multiply the
gradient dE by dW by the inverse of the

38
00:02:52,637 --> 00:02:56,726
curvature matrix.
So H is the curvature matrix, sometimes

39
00:02:56,726 --> 00:03:01,035
called the Hessian.
Its the function of the weights we have

40
00:03:01,035 --> 00:03:05,855
and we need to take its inverse and
multiply the gradient by that,

41
00:03:05,855 --> 00:03:09,580
Then we need to go some distance in that
direction.

42
00:03:09,880 --> 00:03:15,530
If it's a truly quadratic surface and we
choose epsilon correctly, which is quite

43
00:03:15,530 --> 00:03:20,483
easy to do, we'll arrive at the minimum of
the surface in a single step.

44
00:03:20,483 --> 00:03:25,924
Of course, that single step involves
something complicated, which was inverting

45
00:03:25,924 --> 00:03:30,787
that Hessian matrix.
The problem with this is that even if we

46
00:03:30,787 --> 00:03:35,893
only have a million weights in our neural
network, the curvature matrix, the

47
00:03:35,893 --> 00:03:41,000
Hessian, will have a trillion terms, is
completely infeasible to invert it.

48
00:03:42,820 --> 00:03:48,866
So curvature matrices look like this.
For each weight, Wi or Wj,

49
00:03:48,866 --> 00:03:56,764
They tell you how the gradient in one
direction changes as you change in another

50
00:03:56,764 --> 00:04:01,640
direction.
In other words, as I change weight i, how

51
00:04:01,640 --> 00:04:07,784
does the gradient of the error with
respect to weight j change?

52
00:04:07,784 --> 00:04:12,660
That's what a typical off diagonal term
tells you.

53
00:04:13,140 --> 00:04:18,415
The terms on the diagonal tell you how the
gradient of the arrow changes in the

54
00:04:18,415 --> 00:04:21,580
direction of a weight as you change that
weight.

55
00:04:23,040 --> 00:04:29,167
So the off-diagonal terms in a curvature
matrix correspond to twists in the error

56
00:04:29,167 --> 00:04:32,530
surface.
A twist means, when you travel in one

57
00:04:32,530 --> 00:04:36,416
direction, the gradient in another
direction changes.

58
00:04:36,416 --> 00:04:42,167
If we have a nice circular [inaudible],
all those off-diagonal terms are zero.

59
00:04:42,171 --> 00:04:47,476
As we travel in one direction, the
gradient in other directions doesn't

60
00:04:47,476 --> 00:04:52,476
change.
So, what's going wrong with steepest

61
00:04:52,476 --> 00:04:55,721
descent, when you have an elliptical error
surface,

62
00:04:55,721 --> 00:05:00,913
Is that, as we travel in one direction,
the gradient in another direction changes.

63
00:05:00,913 --> 00:05:06,040
And so if I update one of the weights, at
the same time as I'm updating all the

64
00:05:06,040 --> 00:05:11,102
other weights, all those other updates
will cause a change in the gradient for

65
00:05:11,102 --> 00:05:14,671
the first weight.
And that means, when I update it, I may

66
00:05:14,671 --> 00:05:19,019
actually make things worse.
The gradient may have actually reversed

67
00:05:19,019 --> 00:05:22,200
sine due to all the changes in the other
weights.

68
00:05:22,200 --> 00:05:26,877
And so. as we get more and more weights,
we need to be more and more cautious about

69
00:05:26,877 --> 00:05:30,642
changing each one of them,
Because the simultaneous changes in all

70
00:05:30,642 --> 00:05:33,780
the other weights can change the gradient
of our range.

71
00:05:35,820 --> 00:05:40,060
The curvature matrix determines the size
of those interactions.

72
00:05:43,260 --> 00:05:47,291
So we have to deal with the curvature.
We can't just ignore it.

73
00:05:47,291 --> 00:05:52,428
And we'd like to deal with it without
actually inverting a huge matrix, because

74
00:05:52,428 --> 00:05:55,680
the matrix has too many terms in a big
neural net.

75
00:05:58,020 --> 00:06:02,975
One thing we can do is to just look at the
leading diagonal of the curvature matrix

76
00:06:03,425 --> 00:06:06,965
and make our step size depend on that
leading diagonal.

77
00:06:06,965 --> 00:06:10,505
That helps a bit.
It will get us to make different step

78
00:06:10,505 --> 00:06:14,559
sizes for different weights,
But the diagonal turns only a tiny

79
00:06:14,559 --> 00:06:19,193
fraction of the interactions, so we're
ignoring most of the turns in the

80
00:06:19,193 --> 00:06:23,698
curvature matrix when we do that.
In fact, we're ignoring nearly all of

81
00:06:23,698 --> 00:06:28,904
them.
Another thing we could do is, turn

82
00:06:28,904 --> 00:06:34,892
approximate the coverage of matrix with
the matrix of much lower rank that

83
00:06:34,892 --> 00:06:38,781
captures the main aspects of the coverage
matrix.

84
00:06:38,781 --> 00:06:44,769
That was done in Hessian-Free methods and
LBFGS, and many other methods that trying

85
00:06:44,769 --> 00:06:49,980
and do an approximate second order method
for minimizing the error.

86
00:06:52,220 --> 00:06:57,265
In the Hessian-Free Method, we make an
approximation to the curvature matrix and

87
00:06:57,265 --> 00:07:00,355
then we assume that the approximation is
correct.

88
00:07:00,355 --> 00:07:05,401
So we assume we know what the curvature is
and that the error surface really is

89
00:07:05,401 --> 00:07:08,491
quadratic.
And then, starting from wherever we are

90
00:07:08,491 --> 00:07:13,032
now, we minimize the error using an
efficient technique called conjugate

91
00:07:13,032 --> 00:07:16,357
gradient.
Once we've done that, once we got close to

92
00:07:16,357 --> 00:07:20,953
a minimum on this approximation to the
curvature, we then make another

93
00:07:20,953 --> 00:07:26,206
approximation to the curvature matrix and
we use conjugate gradient to minimize

94
00:07:26,206 --> 00:07:32,084
again.
It's also important in recurrent neural

95
00:07:32,084 --> 00:07:37,372
networks to add a penalty for changing any
of the hidden activities too much.

96
00:07:37,372 --> 00:07:42,660
That will prevent us for example, from
changing a weight early on that causes

97
00:07:42,660 --> 00:07:48,154
huge effects later on in the sequence.
We don't want to get effects that are too

98
00:07:48,154 --> 00:07:53,786
big, and if we look at the changes in the
hidden activities we can prevent that by

99
00:07:53,786 --> 00:07:58,181
penalizing those changes.
If we put a quadratic penalty on those

100
00:07:58,181 --> 00:08:02,920
changes, we can combine that with the rest
of the Hessian-Free method.

101
00:08:05,720 --> 00:08:10,071
The last thing I need to explain is
conjugate gradient and I'm just going to

102
00:08:10,071 --> 00:08:14,760
explain it briefly.
Conjugate gradient is a very clever method

103
00:08:14,760 --> 00:08:20,451
that, instead of trying to go straight to
the minimum like in Newton's method, tries

104
00:08:20,451 --> 00:08:25,937
to minimize in one direction at a time.
So it starts off by taking the direction

105
00:08:25,937 --> 00:08:30,188
of steepest descend and goes to the
minimum in that direction.

106
00:08:30,188 --> 00:08:35,262
That might involve re-evaluating the
gradient, re-evaluating the error a few

107
00:08:35,262 --> 00:08:38,280
times to find the minimum in that
direction.

108
00:08:40,540 --> 00:08:46,404
Once its done that, it now finds another
direction and goes to the minimum in that

109
00:08:46,404 --> 00:08:51,361
second direction.
The clever thing about the technique is,

110
00:08:51,361 --> 00:08:56,136
it chooses its second direction in such a
way that it doesn't mess up the

111
00:08:56,136 --> 00:08:59,427
minimization it already did in the first
direction.

112
00:08:59,427 --> 00:09:04,425
That's called a conjugate direction.
Conjugate means that as you go in the new

113
00:09:04,425 --> 00:09:08,756
direction, you don't change the gradients
in the previous directions.

114
00:09:08,756 --> 00:09:12,578
It's a funny idea.
It's like the idea of a twist in an error

115
00:09:12,578 --> 00:09:15,190
surface.
A twist means when you go in one

116
00:09:15,190 --> 00:09:18,693
direction, you change the gradient in
another direction.

117
00:09:18,693 --> 00:09:23,662
And a conjugate direction is one you can
go in that in a sense, doesn't have a

118
00:09:23,662 --> 00:09:26,719
twist.
You go in that direction and the gradient

119
00:09:26,719 --> 00:09:34,687
in the first direction doesn't change.
So here is a picture of an ellipse and the

120
00:09:34,687 --> 00:09:40,632
red line is the major axis of the ellipse.
We start off by doing one step of steepest

121
00:09:40,632 --> 00:09:44,340
descent all the way to the minimum in that
direction.

122
00:09:44,640 --> 00:09:50,309
And if you think about it a bit, you can
see that the minimum won't actually lie on

123
00:09:50,309 --> 00:09:53,724
the red line.
On the red line, the gradient will be

124
00:09:53,724 --> 00:09:58,710
zero, at right angles for that red line,
cuz it's the bottom of the ravine.

125
00:09:58,710 --> 00:10:04,038
But the direction we're going in, isn't
actually at right angles to that point.

126
00:10:04,038 --> 00:10:09,844
We can make a little bit more progress by
making a small step at right angles to the

127
00:10:09,844 --> 00:10:13,260
red line and then a small step along the
red line.

128
00:10:13,260 --> 00:10:19,177
Since the red line slopes down towards the
middle of the elipse, that's going to make

129
00:10:19,177 --> 00:10:23,338
some progress for us.
So when we minimize in the first

130
00:10:23,338 --> 00:10:28,744
direction, we'll go slightly across the
bottom of the ellipse. And when we reach

131
00:10:28,744 --> 00:10:35,480
that point that's a minimum, there's an
interesting property of all the points

132
00:10:35,480 --> 00:10:41,052
that lie on the green line.
On that green line, the gradient in the

133
00:10:41,052 --> 00:10:47,907
direction of that black arrow is zero.
So we can go anywhere along that green

134
00:10:47,907 --> 00:10:53,740
line and we won't destroy the fact that we
are at a minimum in the direction of the

135
00:10:53,740 --> 00:10:57,072
black arrow.
If we can keep doing that from many

136
00:10:57,072 --> 00:11:02,488
directions in a high dimensional error
surface, we'll eventually be at a minimum

137
00:11:02,488 --> 00:11:07,140
in many different directions.
And if we are at a minimum in as many

138
00:11:07,140 --> 00:11:12,278
different directions as there are
dimensions in the space, we'll be at the

139
00:11:12,278 --> 00:11:17,224
global minimum.
So, we take this first step of steepest

140
00:11:17,224 --> 00:11:22,602
descent, we then figure out, and I'm not
going to explain how we do that.

141
00:11:22,602 --> 00:11:28,824
We figure out the direction of that green
line, and then, we do a search along the

142
00:11:28,824 --> 00:11:34,125
green line to find how far we should go in
order to minimize the error along the

143
00:11:34,125 --> 00:11:37,331
green line.
And we take our second step, like this.

144
00:11:37,331 --> 00:11:41,258
And now, in this 2-dimensional space,
we'll be at the minimum.

145
00:11:41,258 --> 00:11:46,362
Because, we're at the minimum in the
direction of the first step and we're now

146
00:11:46,362 --> 00:11:49,568
at a minimum in the direction of the
second step,

147
00:11:49,568 --> 00:11:54,673
While still being at a minimum in the
direction of the first step and so that

148
00:11:54,673 --> 00:12:02,453
must be the global minimum.
What conjugate gradient achieves is that

149
00:12:02,453 --> 00:12:07,627
it gets to the global minimum of an
N-dimensional quadratic surface in only N

150
00:12:07,627 --> 00:12:09,352
steps.
It's very efficient.

151
00:12:09,352 --> 00:12:14,526
It does that because it manages to get the
gradient to be zero in N different

152
00:12:14,526 --> 00:12:17,511
directions.
They're not orthogonal directions,

153
00:12:17,511 --> 00:12:22,685
But they are independent of one another
and so that's efficient to be at the

154
00:12:22,685 --> 00:12:26,704
global minimum.
More importantly, in many less than N

155
00:12:26,704 --> 00:12:32,011
steps on a typical quadratic surface, it
will have reduced the area very close to

156
00:12:32,011 --> 00:12:34,828
the minimum value, and that's why we use
it.

157
00:12:34,828 --> 00:12:40,004
We're not going to do the full N steps,
that would be as expensive as inverting

158
00:12:40,004 --> 00:12:43,804
the whole matrix.
We're going to do many less than N steps,

159
00:12:43,804 --> 00:12:47,080
and we're going to get quite close to the
minimum.

160
00:12:48,340 --> 00:12:53,853
You can apply conjugate gradient directly
to a non-quadratic error surface, like the

161
00:12:53,853 --> 00:12:59,170
error surface for a multilayer non-linear
neural net and it usually works quite

162
00:12:59,170 --> 00:13:02,124
well.
It's essentially a batch method, but you

163
00:13:02,124 --> 00:13:07,178
can apply it to large mini batches.
And when you do that, you do many steps of

164
00:13:07,178 --> 00:13:12,495
conjugate gradient on the same large mini
batch and then you move on to the next

165
00:13:12,495 --> 00:13:15,909
large mini batch.
That's called non-linear conjugate

166
00:13:15,909 --> 00:13:20,263
gradient.
The Hessian-Free optimizer uses conjugate

167
00:13:20,263 --> 00:13:25,447
gradient for minimization on a genuinely
quadratic surface and that's what

168
00:13:25,447 --> 00:13:30,355
conjugate gradient is best at.
It works much better for that than for a

169
00:13:30,355 --> 00:13:34,434
non-linear surface.
This genuinely quadratic surface that HF

170
00:13:34,434 --> 00:13:39,687
is using it for is the quadratic
approximation to the true surface that was

171
00:13:39,687 --> 00:13:44,604
made by the Hessian-Free method.
So it makes that approximation,

172
00:13:44,604 --> 00:13:50,209
It uses conjugant gradient to get close to
a minimum, for the first approximation.

173
00:13:50,209 --> 00:13:55,260
And then it makes a new approximation to
the curvature, and does it again.