1 00:00:00,000 --> 00:00:05,697 In this video, I'm going to give a brief overview of the Hessian-Free optimizer 2 00:00:05,697 --> 00:00:10,870 that can be used to train recurrent neural networks very effectively. 3 00:00:10,870 --> 00:00:17,317 This is a very complicated optimizer and I don't expect you to get all the details of 4 00:00:17,317 --> 00:00:21,965 it from this video. I just want you to have a general feel for 5 00:00:21,965 --> 00:00:25,167 how it works. And then in the next video, we will see 6 00:00:25,167 --> 00:00:27,622 how well it does on an interesting problem. 7 00:00:27,622 --> 00:00:31,962 When we're training the weights of a neural network, we are trying to get as 8 00:00:31,962 --> 00:00:36,814 far down the error surface as possible. So one question is if we choose a given 9 00:00:36,814 --> 00:00:40,619 direction to go in, How much reduction in the arrow can we 10 00:00:40,619 --> 00:00:44,489 achieve by going just the right distance in that direction? 11 00:00:44,489 --> 00:00:48,622 How much does the arrow decrease before it starts rising again? 12 00:00:48,622 --> 00:00:52,034 And here we'll assume that the curvature is constant. 13 00:00:52,034 --> 00:00:55,380 I will assume it really is a quadratic error surface. 14 00:00:55,820 --> 00:01:00,868 We can assume that the magnitude of the gradient decreases as we move down the 15 00:01:00,868 --> 00:01:03,871 gradient. That amounts to assuming that the error 16 00:01:03,871 --> 00:01:09,616 surface is concave upward like a bowl. The maximum reduction that we can get in 17 00:01:09,616 --> 00:01:14,809 the error by going in a particular direction depends on the ratio of the 18 00:01:14,809 --> 00:01:19,717 gradient to the curvature. So we want to move in directions that have 19 00:01:19,717 --> 00:01:23,416 a good ratio. Even if the gradient is quite small, we 20 00:01:23,416 --> 00:01:29,579 want the curvature to be even smaller. So here's an example of a direction we 21 00:01:29,579 --> 00:01:34,417 could move in where the vertical axis corresponds to the error, the horizontal 22 00:01:34,417 --> 00:01:39,255 axis corresponds to the weights in the direction we're moving in, and the blue 23 00:01:39,255 --> 00:01:43,721 arrow corresponds to the reduction we get if we start at that red point. 24 00:01:43,721 --> 00:01:48,373 Here's a surface that has a gentler gradient but because it's got a better 25 00:01:48,373 --> 00:01:53,645 ratio as the gradient to the curvature, we get a bigger reduction in the error by the 26 00:01:53,645 --> 00:01:58,769 time we get to the minimum. The question is, how can we find 27 00:01:58,769 --> 00:02:03,125 directions like that second one? Directions in which even though the 28 00:02:03,125 --> 00:02:06,200 gradient may be small, the curvature is even smaller. 29 00:02:06,800 --> 00:02:11,294 So let's start with Newton's method. Newton's method addresses the basic 30 00:02:11,294 --> 00:02:16,476 problem at its deepest descent, which is that the gradient isn't the direction that 31 00:02:16,476 --> 00:02:21,611 you want to go in. If the error surface has circular cross 32 00:02:21,611 --> 00:02:25,996 sections and is quadrant, the gradient is a good direction to go. 33 00:02:25,996 --> 00:02:33,457 It will point straight at the minimum. So the idea of Newton's method is to apply 34 00:02:33,457 --> 00:02:37,537 a linear transformation that turns ellipses into circles. 35 00:02:37,537 --> 00:02:43,264 If that we apply that transformation to the gradient vector, it will be as if we 36 00:02:43,264 --> 00:02:46,700 were going downhill in a circular error surface. 37 00:02:47,160 --> 00:02:52,637 To do this, we need to multiply the gradient dE by dW by the inverse of the 38 00:02:52,637 --> 00:02:56,726 curvature matrix. So H is the curvature matrix, sometimes 39 00:02:56,726 --> 00:03:01,035 called the Hessian. Its the function of the weights we have 40 00:03:01,035 --> 00:03:05,855 and we need to take its inverse and multiply the gradient by that, 41 00:03:05,855 --> 00:03:09,580 Then we need to go some distance in that direction. 42 00:03:09,880 --> 00:03:15,530 If it's a truly quadratic surface and we choose epsilon correctly, which is quite 43 00:03:15,530 --> 00:03:20,483 easy to do, we'll arrive at the minimum of the surface in a single step. 44 00:03:20,483 --> 00:03:25,924 Of course, that single step involves something complicated, which was inverting 45 00:03:25,924 --> 00:03:30,787 that Hessian matrix. The problem with this is that even if we 46 00:03:30,787 --> 00:03:35,893 only have a million weights in our neural network, the curvature matrix, the 47 00:03:35,893 --> 00:03:41,000 Hessian, will have a trillion terms, is completely infeasible to invert it. 48 00:03:42,820 --> 00:03:48,866 So curvature matrices look like this. For each weight, Wi or Wj, 49 00:03:48,866 --> 00:03:56,764 They tell you how the gradient in one direction changes as you change in another 50 00:03:56,764 --> 00:04:01,640 direction. In other words, as I change weight i, how 51 00:04:01,640 --> 00:04:07,784 does the gradient of the error with respect to weight j change? 52 00:04:07,784 --> 00:04:12,660 That's what a typical off diagonal term tells you. 53 00:04:13,140 --> 00:04:18,415 The terms on the diagonal tell you how the gradient of the arrow changes in the 54 00:04:18,415 --> 00:04:21,580 direction of a weight as you change that weight. 55 00:04:23,040 --> 00:04:29,167 So the off-diagonal terms in a curvature matrix correspond to twists in the error 56 00:04:29,167 --> 00:04:32,530 surface. A twist means, when you travel in one 57 00:04:32,530 --> 00:04:36,416 direction, the gradient in another direction changes. 58 00:04:36,416 --> 00:04:42,167 If we have a nice circular [inaudible], all those off-diagonal terms are zero. 59 00:04:42,171 --> 00:04:47,476 As we travel in one direction, the gradient in other directions doesn't 60 00:04:47,476 --> 00:04:52,476 change. So, what's going wrong with steepest 61 00:04:52,476 --> 00:04:55,721 descent, when you have an elliptical error surface, 62 00:04:55,721 --> 00:05:00,913 Is that, as we travel in one direction, the gradient in another direction changes. 63 00:05:00,913 --> 00:05:06,040 And so if I update one of the weights, at the same time as I'm updating all the 64 00:05:06,040 --> 00:05:11,102 other weights, all those other updates will cause a change in the gradient for 65 00:05:11,102 --> 00:05:14,671 the first weight. And that means, when I update it, I may 66 00:05:14,671 --> 00:05:19,019 actually make things worse. The gradient may have actually reversed 67 00:05:19,019 --> 00:05:22,200 sine due to all the changes in the other weights. 68 00:05:22,200 --> 00:05:26,877 And so. as we get more and more weights, we need to be more and more cautious about 69 00:05:26,877 --> 00:05:30,642 changing each one of them, Because the simultaneous changes in all 70 00:05:30,642 --> 00:05:33,780 the other weights can change the gradient of our range. 71 00:05:35,820 --> 00:05:40,060 The curvature matrix determines the size of those interactions. 72 00:05:43,260 --> 00:05:47,291 So we have to deal with the curvature. We can't just ignore it. 73 00:05:47,291 --> 00:05:52,428 And we'd like to deal with it without actually inverting a huge matrix, because 74 00:05:52,428 --> 00:05:55,680 the matrix has too many terms in a big neural net. 75 00:05:58,020 --> 00:06:02,975 One thing we can do is to just look at the leading diagonal of the curvature matrix 76 00:06:03,425 --> 00:06:06,965 and make our step size depend on that leading diagonal. 77 00:06:06,965 --> 00:06:10,505 That helps a bit. It will get us to make different step 78 00:06:10,505 --> 00:06:14,559 sizes for different weights, But the diagonal turns only a tiny 79 00:06:14,559 --> 00:06:19,193 fraction of the interactions, so we're ignoring most of the turns in the 80 00:06:19,193 --> 00:06:23,698 curvature matrix when we do that. In fact, we're ignoring nearly all of 81 00:06:23,698 --> 00:06:28,904 them. Another thing we could do is, turn 82 00:06:28,904 --> 00:06:34,892 approximate the coverage of matrix with the matrix of much lower rank that 83 00:06:34,892 --> 00:06:38,781 captures the main aspects of the coverage matrix. 84 00:06:38,781 --> 00:06:44,769 That was done in Hessian-Free methods and LBFGS, and many other methods that trying 85 00:06:44,769 --> 00:06:49,980 and do an approximate second order method for minimizing the error. 86 00:06:52,220 --> 00:06:57,265 In the Hessian-Free Method, we make an approximation to the curvature matrix and 87 00:06:57,265 --> 00:07:00,355 then we assume that the approximation is correct. 88 00:07:00,355 --> 00:07:05,401 So we assume we know what the curvature is and that the error surface really is 89 00:07:05,401 --> 00:07:08,491 quadratic. And then, starting from wherever we are 90 00:07:08,491 --> 00:07:13,032 now, we minimize the error using an efficient technique called conjugate 91 00:07:13,032 --> 00:07:16,357 gradient. Once we've done that, once we got close to 92 00:07:16,357 --> 00:07:20,953 a minimum on this approximation to the curvature, we then make another 93 00:07:20,953 --> 00:07:26,206 approximation to the curvature matrix and we use conjugate gradient to minimize 94 00:07:26,206 --> 00:07:32,084 again. It's also important in recurrent neural 95 00:07:32,084 --> 00:07:37,372 networks to add a penalty for changing any of the hidden activities too much. 96 00:07:37,372 --> 00:07:42,660 That will prevent us for example, from changing a weight early on that causes 97 00:07:42,660 --> 00:07:48,154 huge effects later on in the sequence. We don't want to get effects that are too 98 00:07:48,154 --> 00:07:53,786 big, and if we look at the changes in the hidden activities we can prevent that by 99 00:07:53,786 --> 00:07:58,181 penalizing those changes. If we put a quadratic penalty on those 100 00:07:58,181 --> 00:08:02,920 changes, we can combine that with the rest of the Hessian-Free method. 101 00:08:05,720 --> 00:08:10,071 The last thing I need to explain is conjugate gradient and I'm just going to 102 00:08:10,071 --> 00:08:14,760 explain it briefly. Conjugate gradient is a very clever method 103 00:08:14,760 --> 00:08:20,451 that, instead of trying to go straight to the minimum like in Newton's method, tries 104 00:08:20,451 --> 00:08:25,937 to minimize in one direction at a time. So it starts off by taking the direction 105 00:08:25,937 --> 00:08:30,188 of steepest descend and goes to the minimum in that direction. 106 00:08:30,188 --> 00:08:35,262 That might involve re-evaluating the gradient, re-evaluating the error a few 107 00:08:35,262 --> 00:08:38,280 times to find the minimum in that direction. 108 00:08:40,540 --> 00:08:46,404 Once its done that, it now finds another direction and goes to the minimum in that 109 00:08:46,404 --> 00:08:51,361 second direction. The clever thing about the technique is, 110 00:08:51,361 --> 00:08:56,136 it chooses its second direction in such a way that it doesn't mess up the 111 00:08:56,136 --> 00:08:59,427 minimization it already did in the first direction. 112 00:08:59,427 --> 00:09:04,425 That's called a conjugate direction. Conjugate means that as you go in the new 113 00:09:04,425 --> 00:09:08,756 direction, you don't change the gradients in the previous directions. 114 00:09:08,756 --> 00:09:12,578 It's a funny idea. It's like the idea of a twist in an error 115 00:09:12,578 --> 00:09:15,190 surface. A twist means when you go in one 116 00:09:15,190 --> 00:09:18,693 direction, you change the gradient in another direction. 117 00:09:18,693 --> 00:09:23,662 And a conjugate direction is one you can go in that in a sense, doesn't have a 118 00:09:23,662 --> 00:09:26,719 twist. You go in that direction and the gradient 119 00:09:26,719 --> 00:09:34,687 in the first direction doesn't change. So here is a picture of an ellipse and the 120 00:09:34,687 --> 00:09:40,632 red line is the major axis of the ellipse. We start off by doing one step of steepest 121 00:09:40,632 --> 00:09:44,340 descent all the way to the minimum in that direction. 122 00:09:44,640 --> 00:09:50,309 And if you think about it a bit, you can see that the minimum won't actually lie on 123 00:09:50,309 --> 00:09:53,724 the red line. On the red line, the gradient will be 124 00:09:53,724 --> 00:09:58,710 zero, at right angles for that red line, cuz it's the bottom of the ravine. 125 00:09:58,710 --> 00:10:04,038 But the direction we're going in, isn't actually at right angles to that point. 126 00:10:04,038 --> 00:10:09,844 We can make a little bit more progress by making a small step at right angles to the 127 00:10:09,844 --> 00:10:13,260 red line and then a small step along the red line. 128 00:10:13,260 --> 00:10:19,177 Since the red line slopes down towards the middle of the elipse, that's going to make 129 00:10:19,177 --> 00:10:23,338 some progress for us. So when we minimize in the first 130 00:10:23,338 --> 00:10:28,744 direction, we'll go slightly across the bottom of the ellipse. And when we reach 131 00:10:28,744 --> 00:10:35,480 that point that's a minimum, there's an interesting property of all the points 132 00:10:35,480 --> 00:10:41,052 that lie on the green line. On that green line, the gradient in the 133 00:10:41,052 --> 00:10:47,907 direction of that black arrow is zero. So we can go anywhere along that green 134 00:10:47,907 --> 00:10:53,740 line and we won't destroy the fact that we are at a minimum in the direction of the 135 00:10:53,740 --> 00:10:57,072 black arrow. If we can keep doing that from many 136 00:10:57,072 --> 00:11:02,488 directions in a high dimensional error surface, we'll eventually be at a minimum 137 00:11:02,488 --> 00:11:07,140 in many different directions. And if we are at a minimum in as many 138 00:11:07,140 --> 00:11:12,278 different directions as there are dimensions in the space, we'll be at the 139 00:11:12,278 --> 00:11:17,224 global minimum. So, we take this first step of steepest 140 00:11:17,224 --> 00:11:22,602 descent, we then figure out, and I'm not going to explain how we do that. 141 00:11:22,602 --> 00:11:28,824 We figure out the direction of that green line, and then, we do a search along the 142 00:11:28,824 --> 00:11:34,125 green line to find how far we should go in order to minimize the error along the 143 00:11:34,125 --> 00:11:37,331 green line. And we take our second step, like this. 144 00:11:37,331 --> 00:11:41,258 And now, in this 2-dimensional space, we'll be at the minimum. 145 00:11:41,258 --> 00:11:46,362 Because, we're at the minimum in the direction of the first step and we're now 146 00:11:46,362 --> 00:11:49,568 at a minimum in the direction of the second step, 147 00:11:49,568 --> 00:11:54,673 While still being at a minimum in the direction of the first step and so that 148 00:11:54,673 --> 00:12:02,453 must be the global minimum. What conjugate gradient achieves is that 149 00:12:02,453 --> 00:12:07,627 it gets to the global minimum of an N-dimensional quadratic surface in only N 150 00:12:07,627 --> 00:12:09,352 steps. It's very efficient. 151 00:12:09,352 --> 00:12:14,526 It does that because it manages to get the gradient to be zero in N different 152 00:12:14,526 --> 00:12:17,511 directions. They're not orthogonal directions, 153 00:12:17,511 --> 00:12:22,685 But they are independent of one another and so that's efficient to be at the 154 00:12:22,685 --> 00:12:26,704 global minimum. More importantly, in many less than N 155 00:12:26,704 --> 00:12:32,011 steps on a typical quadratic surface, it will have reduced the area very close to 156 00:12:32,011 --> 00:12:34,828 the minimum value, and that's why we use it. 157 00:12:34,828 --> 00:12:40,004 We're not going to do the full N steps, that would be as expensive as inverting 158 00:12:40,004 --> 00:12:43,804 the whole matrix. We're going to do many less than N steps, 159 00:12:43,804 --> 00:12:47,080 and we're going to get quite close to the minimum. 160 00:12:48,340 --> 00:12:53,853 You can apply conjugate gradient directly to a non-quadratic error surface, like the 161 00:12:53,853 --> 00:12:59,170 error surface for a multilayer non-linear neural net and it usually works quite 162 00:12:59,170 --> 00:13:02,124 well. It's essentially a batch method, but you 163 00:13:02,124 --> 00:13:07,178 can apply it to large mini batches. And when you do that, you do many steps of 164 00:13:07,178 --> 00:13:12,495 conjugate gradient on the same large mini batch and then you move on to the next 165 00:13:12,495 --> 00:13:15,909 large mini batch. That's called non-linear conjugate 166 00:13:15,909 --> 00:13:20,263 gradient. The Hessian-Free optimizer uses conjugate 167 00:13:20,263 --> 00:13:25,447 gradient for minimization on a genuinely quadratic surface and that's what 168 00:13:25,447 --> 00:13:30,355 conjugate gradient is best at. It works much better for that than for a 169 00:13:30,355 --> 00:13:34,434 non-linear surface. This genuinely quadratic surface that HF 170 00:13:34,434 --> 00:13:39,687 is using it for is the quadratic approximation to the true surface that was 171 00:13:39,687 --> 00:13:44,604 made by the Hessian-Free method. So it makes that approximation, 172 00:13:44,604 --> 00:13:50,209 It uses conjugant gradient to get close to a minimum, for the first approximation. 173 00:13:50,209 --> 00:13:55,260 And then it makes a new approximation to the curvature, and does it again.