1 00:00:00,000 --> 00:00:05,158 In this video, we're going to look at a method that was developed in the late 2 00:00:05,158 --> 00:00:09,848 1980's by Robbie Jacobs and then improved by a number of other people. 3 00:00:09,848 --> 00:00:15,208 The idea is that each connection in the neural net should have its own adaptive 4 00:00:15,208 --> 00:00:20,702 learning rate, which we set empirically by observing what happens to the weight on 5 00:00:20,702 --> 00:00:25,938 that connection when we update it. So that if the weight keeps reversing its 6 00:00:25,938 --> 00:00:28,755 gradient, we turn down the learning weight. 7 00:00:28,755 --> 00:00:33,404 And if the gradient stays consistent, we turn up the learning weight. 8 00:00:33,404 --> 00:00:38,968 So, let's start by thinking why having separate adaptive learning weights on each 9 00:00:38,968 --> 00:00:43,593 connection is a good idea. The problem is, they're in a deep 10 00:00:43,593 --> 00:00:46,704 multilayer net. The learning weights can vary widely 11 00:00:46,704 --> 00:00:51,246 between different weights, especially between weights in different layers. 12 00:00:51,246 --> 00:00:56,037 So, if for example, we start with small weights, the gradience starts from much 13 00:00:56,037 --> 00:00:59,460 smaller in the initial layers than in the later layers. 14 00:01:00,020 --> 00:01:05,027 Another factor that causes one different learning rate for different weights is the 15 00:01:05,027 --> 00:01:08,385 fan-in of the unit. The fan-in determines the size of the 16 00:01:08,385 --> 00:01:12,745 overshoot effects that you get when you simultaneously change many of the 17 00:01:12,745 --> 00:01:15,690 different incoming weights to fix up the same error. 18 00:01:15,690 --> 00:01:20,462 It maybe that the unit didn't get enough input, when you change all these weights 19 00:01:20,462 --> 00:01:24,232 at the same time to fix up the error, it now gets too much input. 20 00:01:24,232 --> 00:01:28,180 Obviously, that effect is going to be bigger if there's a bigger fan-in. 21 00:01:28,640 --> 00:01:34,438 So, the net in the diagram on the right has the same fain-in for both layers more 22 00:01:34,438 --> 00:01:40,040 or less the same fain-in for both layers, but that's very different in some nets. 23 00:01:41,220 --> 00:01:45,735 So, the idea is that we're going to use a global learning weight which we set by 24 00:01:45,735 --> 00:01:50,079 hand, and then we're going to multiply it by a local gain that is determined 25 00:01:50,079 --> 00:01:55,702 empirically for each weight. A simple way to determine what those local 26 00:01:55,702 --> 00:02:01,079 gains should be is to start with a local gain of one for every weight. 27 00:02:01,079 --> 00:02:06,687 So that, initially we're going to change the weight, Wij, by the learning rate 28 00:02:06,687 --> 00:02:11,219 times the gain of one, gij times the error derivative for that weight. 29 00:02:11,219 --> 00:02:15,060 Then, what we're going to do is we're going to adapt gij. 30 00:02:15,680 --> 00:02:21,589 We're going to increase gij if the gradient for the weight does not change 31 00:02:21,589 --> 00:02:25,287 side. And we're going to use small additive 32 00:02:25,287 --> 00:02:31,536 increases, and multiplicative decreases. So, if the gradient for the weight at time 33 00:02:31,536 --> 00:02:36,840 t has the same sign as the gradient for the weight at time t minus one, with t 34 00:02:36,840 --> 00:02:42,457 refers to weight updates, then when you take that product, it'll be positive. 35 00:02:42,457 --> 00:02:48,931 Cuz you already get two negative gradients or two positive gradients, and then what 36 00:02:48,931 --> 00:02:53,300 we're going to go is increase gij by small additive amount. 37 00:02:53,300 --> 00:02:58,425 If the gradients have opposite signs, we're going to decrease gij. And because 38 00:02:58,425 --> 00:03:03,753 we want to damp down gij quickly if it's already big, we're going to decrease it 39 00:03:03,753 --> 00:03:08,657 multiplicatively. That ensures that big gains will decay 40 00:03:08,657 --> 00:03:14,465 very rapidly if oscillation start. It's interesting to ask what would happen 41 00:03:14,465 --> 00:03:18,990 if the grading was totally random. So, on each update of the weights, pick a 42 00:03:18,990 --> 00:03:22,483 random gradient. Then, you'll get an equal number of 43 00:03:22,483 --> 00:03:27,758 increases and decreases cuz it will equally often be the same sign as the 44 00:03:27,758 --> 00:03:33,334 previous gradient or the opposite sign. And so, you'll get a bunch of additive 45 00:03:33,334 --> 00:03:38,789 0.05 increases, and multiplicative 0.95 decreases, and they have an equilibrium 46 00:03:38,789 --> 00:03:43,748 point which is when the gain is one. If the gain's bigger than one, the 47 00:03:43,748 --> 00:03:49,273 multiplying by 0.95 will reduce it by more than adding 0.05. If the gain's smaller 48 00:03:49,273 --> 00:03:54,657 than one, adding 0.05 will increase it more than multiplying by 0.95 decreases 49 00:03:54,657 --> 00:03:57,020 it. So, with random gradients, we'll hover 50 00:03:57,020 --> 00:04:00,159 around one. And if the gradient is consistently in the 51 00:04:00,159 --> 00:04:02,891 same direction we can get much bigger than one. 52 00:04:02,891 --> 00:04:07,134 If the gradient is consistently in opposite directions, which means we're 53 00:04:07,134 --> 00:04:10,680 oscillating across a ravine, we can get much smaller than one. 54 00:04:11,720 --> 00:04:16,709 There's a number of tricks for making the adaptive learning rates work better. 55 00:04:16,709 --> 00:04:19,651 It's important to limit the size of the gains. 56 00:04:19,651 --> 00:04:22,465 A reasonable range is 0.1 to ten. Or 0.1 to 100. 57 00:04:22,465 --> 00:04:27,391 You don't want the gains to get huge because then you can easily get into an 58 00:04:27,391 --> 00:04:32,188 instability and they won't die down fast enough, and you'll destroy all the 59 00:04:32,188 --> 00:04:36,295 weights. The adaptive learning rates was designed 60 00:04:36,295 --> 00:04:40,589 for full batch learning. You can also apply it with mini batches 61 00:04:40,589 --> 00:04:43,316 but they had better be pretty big mini batches. 62 00:04:43,316 --> 00:04:48,429 That'll ensure that the sign, changing signs of gradience aren't due to the 63 00:04:48,429 --> 00:04:53,270 sampling error of mini batches, They are really due to the other side of 64 00:04:53,270 --> 00:04:58,431 the ravine. There's nothing to prevent you combining 65 00:04:58,431 --> 00:05:03,552 adaptive learning rates with momentum. So, Jacob suggests that, instead of using 66 00:05:03,552 --> 00:05:08,869 the agreement in sign between the current gradient and the previous gradient, you 67 00:05:08,869 --> 00:05:14,187 use the agreement in sign between the current gradient and the velocity for that 68 00:05:14,187 --> 00:05:18,782 weight, so the accumulated gradient. And, if you do that, you get a nice 69 00:05:18,782 --> 00:05:24,166 combination of the advantages of momentum, and the advantages of adaptive learning 70 00:05:24,166 --> 00:05:27,880 rates. So, adaptive learning rates only deal with 71 00:05:27,880 --> 00:05:33,052 axis of line defects. Whereas, momentum doesn't care about the 72 00:05:33,052 --> 00:05:37,001 alignment of the axis. Momentum can deal with these diagonal 73 00:05:37,001 --> 00:05:42,332 ellipses and going in that diagonal direction quickly which adaptive learning 74 00:05:42,332 --> 00:05:43,320 rates can't do.