In this video, we're going to look at a method that was developed in the late 1980's by Robbie Jacobs and then improved by a number of other people. The idea is that each connection in the neural net should have its own adaptive learning rate, which we set empirically by observing what happens to the weight on that connection when we update it. So that if the weight keeps reversing its gradient, we turn down the learning weight. And if the gradient stays consistent, we turn up the learning weight. So, let's start by thinking why having separate adaptive learning weights on each connection is a good idea. The problem is, they're in a deep multilayer net. The learning weights can vary widely between different weights, especially between weights in different layers. So, if for example, we start with small weights, the gradience starts from much smaller in the initial layers than in the later layers. Another factor that causes one different learning rate for different weights is the fan-in of the unit. The fan-in determines the size of the overshoot effects that you get when you simultaneously change many of the different incoming weights to fix up the same error. It maybe that the unit didn't get enough input, when you change all these weights at the same time to fix up the error, it now gets too much input. Obviously, that effect is going to be bigger if there's a bigger fan-in. So, the net in the diagram on the right has the same fain-in for both layers more or less the same fain-in for both layers, but that's very different in some nets. So, the idea is that we're going to use a global learning weight which we set by hand, and then we're going to multiply it by a local gain that is determined empirically for each weight. A simple way to determine what those local gains should be is to start with a local gain of one for every weight. So that, initially we're going to change the weight, Wij, by the learning rate times the gain of one, gij times the error derivative for that weight. Then, what we're going to do is we're going to adapt gij. We're going to increase gij if the gradient for the weight does not change side. And we're going to use small additive increases, and multiplicative decreases. So, if the gradient for the weight at time t has the same sign as the gradient for the weight at time t minus one, with t refers to weight updates, then when you take that product, it'll be positive. Cuz you already get two negative gradients or two positive gradients, and then what we're going to go is increase gij by small additive amount. If the gradients have opposite signs, we're going to decrease gij. And because we want to damp down gij quickly if it's already big, we're going to decrease it multiplicatively. That ensures that big gains will decay very rapidly if oscillation start. It's interesting to ask what would happen if the grading was totally random. So, on each update of the weights, pick a random gradient. Then, you'll get an equal number of increases and decreases cuz it will equally often be the same sign as the previous gradient or the opposite sign. And so, you'll get a bunch of additive 0.05 increases, and multiplicative 0.95 decreases, and they have an equilibrium point which is when the gain is one. If the gain's bigger than one, the multiplying by 0.95 will reduce it by more than adding 0.05. If the gain's smaller than one, adding 0.05 will increase it more than multiplying by 0.95 decreases it. So, with random gradients, we'll hover around one. And if the gradient is consistently in the same direction we can get much bigger than one. If the gradient is consistently in opposite directions, which means we're oscillating across a ravine, we can get much smaller than one. There's a number of tricks for making the adaptive learning rates work better. It's important to limit the size of the gains. A reasonable range is 0.1 to ten. Or 0.1 to 100. You don't want the gains to get huge because then you can easily get into an instability and they won't die down fast enough, and you'll destroy all the weights. The adaptive learning rates was designed for full batch learning. You can also apply it with mini batches but they had better be pretty big mini batches. That'll ensure that the sign, changing signs of gradience aren't due to the sampling error of mini batches, They are really due to the other side of the ravine. There's nothing to prevent you combining adaptive learning rates with momentum. So, Jacob suggests that, instead of using the agreement in sign between the current gradient and the previous gradient, you use the agreement in sign between the current gradient and the velocity for that weight, so the accumulated gradient. And, if you do that, you get a nice combination of the advantages of momentum, and the advantages of adaptive learning rates. So, adaptive learning rates only deal with axis of line defects. Whereas, momentum doesn't care about the alignment of the axis. Momentum can deal with these diagonal ellipses and going in that diagonal direction quickly which adaptive learning rates can't do.