1 00:00:00,000 --> 00:00:04,805 In this video, I'm first going to introduce a method called rprop, that is 2 00:00:04,805 --> 00:00:09,679 used for full batch learning. It's like Robbie Jacobs method, but not 3 00:00:09,679 --> 00:00:13,454 quite the same. I'm then going to show how to extend RPROP 4 00:00:13,454 --> 00:00:18,946 so that it works for mini-batches. This gives you the advantages of rprop and it 5 00:00:18,946 --> 00:00:24,507 also gives you the advantage of mini-batch learning, which is essential for large, 6 00:00:24,507 --> 00:00:29,159 redundant data sets. The method that we end up with called RMS 7 00:00:29,159 --> 00:00:34,313 Pro is currently my favorite method as a sort of basic method for learning the 8 00:00:34,313 --> 00:00:38,880 weights in a large neural network with a large redundant data set. 9 00:00:39,360 --> 00:00:44,865 I'm now going to describe rprop which is an interesting way of trying to deal with 10 00:00:44,865 --> 00:00:48,580 the fact that gradients vary widely in their magnitudes. 11 00:00:50,740 --> 00:00:53,916 Some gradients can be tiny and others can be huge. 12 00:00:53,916 --> 00:00:57,920 And that makes it hard to choose a single global learning rate. 13 00:00:58,520 --> 00:01:03,552 If we're doing full batch learning, we can cope with this big variations in 14 00:01:03,552 --> 00:01:06,840 gradients, by just using the sign of the gradient. 15 00:01:07,660 --> 00:01:11,580 That makes all of the weight updates be the same size. 16 00:01:13,280 --> 00:01:18,597 For issues like escaping from plateaus with very small gradients this is a great 17 00:01:18,597 --> 00:01:23,061 technique cause even with tiny gradients we'll take quite big steps. 18 00:01:23,061 --> 00:01:28,248 We couldn't achieve that by just turning up the learning rate because then the 19 00:01:28,248 --> 00:01:32,843 steps we took for weights that had big gradients would be much to big. 20 00:01:32,843 --> 00:01:38,161 Rprop combines the idea of just using the sign of the gradient with the idea of 21 00:01:38,161 --> 00:01:41,764 making the step size. Depend on which weight it is. 22 00:01:41,764 --> 00:01:46,896 So to decide how much to change your weight, you don't look at the magnitude of 23 00:01:46,896 --> 00:01:50,469 the gradient, you just look at the sign of the gradient. 24 00:01:50,469 --> 00:01:54,887 But, you do look at the step size you decided around for that weight. 25 00:01:54,887 --> 00:01:59,955 And, that step size adopts over time, again without looking at the magnitude of 26 00:01:59,955 --> 00:02:04,626 the gradient. So we increase the step size for a weight 27 00:02:04,626 --> 00:02:07,562 multiplicatively. For example by factor 1.2. 28 00:02:07,562 --> 00:02:10,633 If the signs of the last two gradients agree. 29 00:02:10,633 --> 00:02:16,299 This is like in Robbie Jacobs' adapted weights methods except that we did, gonna 30 00:02:16,299 --> 00:02:21,376 do a multiplicative increase here. If the signs of the last two gradients 31 00:02:21,376 --> 00:02:26,139 disagree, we decrease the step size multiplicatively, and in this case, we'll 32 00:02:26,139 --> 00:02:31,282 make that more powerful than the increase, so that we can die down faster than we 33 00:02:31,282 --> 00:02:33,693 grow. We need to limit the step sizes. 34 00:02:33,693 --> 00:02:38,075 Mike Shuster's advice was to limit them between 50 and a millionth. 35 00:02:38,271 --> 00:02:42,261 I think it depends a lot on what problem you're dealing with. 36 00:02:42,261 --> 00:02:47,494 If for example you have a problem with some tiny inputs, you might need very big 37 00:02:47,494 --> 00:02:50,830 weights on those inputs for them to have an effect. 38 00:02:50,830 --> 00:02:55,534 I suspect that if you're not dealing with that kind of problem, having an upper 39 00:02:55,534 --> 00:02:59,941 limit on the weight changes that's much less than 50 would be a good idea. 40 00:02:59,941 --> 00:03:03,515 So one question is, why doesn't rprop work with mini-batches. 41 00:03:03,515 --> 00:03:06,850 People have tried it, and find it hard to get it to work. 42 00:03:06,850 --> 00:03:11,257 You can get it to work with very big mini-batches, where you use much more 43 00:03:11,257 --> 00:03:16,045 conservative changes to the step sizes. But it's difficult. 44 00:03:16,045 --> 00:03:21,440 So the reason it doesn't work is it violates the central idea behind 45 00:03:21,440 --> 00:03:26,365 stochastic gradient descent, Which is, that when we have a small 46 00:03:26,365 --> 00:03:31,994 loaning rate, the gradient gets effectively average over successive mini 47 00:03:31,994 --> 00:03:37,495 batches. So consider a weight that gets a gradient 48 00:03:37,495 --> 00:03:44,270 of +.01 on nine mini batches, and then a gradient of -.09 on the tenth mini batch. 49 00:03:44,270 --> 00:03:49,006 What we'd like is those gradients will roughly average out so the weight will 50 00:03:49,006 --> 00:03:51,617 stay where it is. Rprop won't give us that. 51 00:03:51,617 --> 00:03:56,414 Rprop would increment the weight nine times by whatever its current step size 52 00:03:56,414 --> 00:04:00,664 is, and decrement it only once. And that would make the weight get much 53 00:04:00,664 --> 00:04:04,141 bigger. We're assuming here that the step sizes 54 00:04:04,141 --> 00:04:09,070 adapt much slower than the time scale of these mini batches. 55 00:04:09,070 --> 00:04:15,297 So the question is, can we combine the robustness that you get from rprop by just 56 00:04:15,297 --> 00:04:20,170 using the sign of the gradient. The efficiency that you get from many 57 00:04:20,170 --> 00:04:23,334 batches. And this averaging of gradients over 58 00:04:23,334 --> 00:04:28,894 mini-batches is what allows mini-batches to combine gradients in the right way. 59 00:04:28,894 --> 00:04:32,343 That leads to a method which I'm calling Rmsprop. 60 00:04:32,343 --> 00:04:37,974 And you can consider to be a mini-batch version of rprop. rprop is equivalent to 61 00:04:37,974 --> 00:04:42,268 using the gradient, But also dividing by the magnitude of the 62 00:04:42,268 --> 00:04:45,435 gradient. And the reason it has problems with 63 00:04:45,435 --> 00:04:50,925 mini-batches is that we divide the gradient by a different magnitude for each 64 00:04:50,925 --> 00:04:54,522 mini batch. So the idea is that we're going to force 65 00:04:54,522 --> 00:05:00,539 the number we divide by to be pretty much the same for nearby mini-batches. We do 66 00:05:00,539 --> 00:05:05,961 that by keeping a moving average of the squared gradient for each weight. 67 00:05:05,961 --> 00:05:11,012 So mean square WT means this moving average for weight W at time T, 68 00:05:11,012 --> 00:05:14,354 Where time is an indicator of weight updates. 69 00:05:14,354 --> 00:05:20,891 Time increments by one each time we update the weights The numbers I put in of 0.9 70 00:05:20,891 --> 00:05:25,915 and 0.1 for computing moving average are just examples, but their reasonably 71 00:05:25,915 --> 00:05:31,361 sensible examples. So the mean square is the previous mean 72 00:05:31,361 --> 00:05:36,920 square times 0.9, Plus the value of the squared gradient for 73 00:05:36,920 --> 00:05:40,041 that weight at time t, Times 0.1. 74 00:05:40,041 --> 00:05:45,308 We then take that mean square. We take its square root, 75 00:05:45,308 --> 00:05:52,428 Which is why it has the name RMS. And then we divide the gradient by that 76 00:05:52,428 --> 00:05:56,720 RMS, and make an update proportional to that. 77 00:05:57,720 --> 00:06:02,129 That makes the learning work much better. Notice that we're not adapting the 78 00:06:02,129 --> 00:06:05,030 learning rate separately for each connection here. 79 00:06:05,030 --> 00:06:09,440 This is a simpler method where we simply, for each connection, keep a running 80 00:06:09,440 --> 00:06:12,980 average of the route mean square gradient and divide by that. 81 00:06:12,980 --> 00:06:17,797 There's many further developments one could make for rmsprop. You could combine 82 00:06:17,797 --> 00:06:21,591 the standard moment. My experiment so far suggests that doesn't 83 00:06:21,591 --> 00:06:26,175 help as much as momentum normally does, And that needs more investigation. 84 00:06:26,175 --> 00:06:31,835 You could combine our rmsprop with Nesterov momentum where you first make the 85 00:06:31,835 --> 00:06:36,560 jump and then make a correction. And Ilya Sutskever has tried that recently 86 00:06:36,560 --> 00:06:40,645 and got good results. He's discovered that it works best if the 87 00:06:40,645 --> 00:06:45,816 rms of the recent gradients is used to divide the correction term we make rather 88 00:06:45,816 --> 00:06:50,732 than the large jump you make in the direction of the accumulated corrections. 89 00:06:50,732 --> 00:06:56,158 Obviously you could combine rmsprop with adaptive learning rates on each connection 90 00:06:56,158 --> 00:07:00,968 which would make it much more like rprop. That just needs a lot more investigation. 91 00:07:00,968 --> 00:07:03,771 I just don't know at present how helpful that will be. 92 00:07:03,771 --> 00:07:08,131 And then there is a bunch of other methods related to rmsprop that have a lot in 93 00:07:08,131 --> 00:07:11,508 common with it. Yann LeCun's group has an interesting 94 00:07:11,508 --> 00:07:16,142 paper called No More Pesky Learning Rates that came out this year. 95 00:07:16,142 --> 00:07:21,689 And some of the terms in that looked like rmsprop, but it has many other terms. 96 00:07:21,689 --> 00:07:27,377 I suspect, at present, that most of the advantage that comes from this complicated 97 00:07:27,377 --> 00:07:33,064 method recommended by Yann LeCun's group comes from the fact that it's similar to 98 00:07:33,064 --> 00:07:35,943 rmsprop. But I don't really know that. 99 00:07:35,943 --> 00:07:41,210 So, a summary of the learning methods for neural networks, goes like this. 100 00:07:41,210 --> 00:07:46,350 If you've got a small data set, say 10,000 cases or less, 101 00:07:46,350 --> 00:07:52,200 Or a big data set without much redundancy, you should consider using a full batch 102 00:07:52,200 --> 00:07:55,776 method. This full batch methods adapted from the 103 00:07:55,776 --> 00:08:00,484 optimization literature like non-linear conjugate gradient or lbfgs, or 104 00:08:00,484 --> 00:08:03,963 LevenbergMarkhart,Marquardt. And one advantage of using those methods 105 00:08:03,963 --> 00:08:09,284 is they typically come with a package. And when you report the results in your 106 00:08:09,284 --> 00:08:14,059 paper you just have to say, I used this package and here's what it did. 107 00:08:14,059 --> 00:08:17,880 You don't have to justify all sorts of little decisions. 108 00:08:18,160 --> 00:08:23,025 Alternatively you could use the adaptive learning rates I described in another 109 00:08:23,025 --> 00:08:28,136 video or rprop, which are both essentially full batch methods but they are methods 110 00:08:28,136 --> 00:08:33,186 that were developed for neural networks. If you have a big redundant data set it's 111 00:08:33,186 --> 00:08:37,066 essential to use mini batches. It's a huge waste not to do that. 112 00:08:37,066 --> 00:08:41,070 The first thing to try is just standard gradient descent with momentum. 113 00:08:41,070 --> 00:08:45,860 You're going to have to choose a global learning rate, and you might want to write 114 00:08:45,860 --> 00:08:50,105 a little loop to adapt that global learning rate based on whether the 115 00:08:50,105 --> 00:08:53,864 gradient has changed side. But to begin with, don't go for anything 116 00:08:53,864 --> 00:08:58,109 as fancy as adapting individual learning rates for individual weights. 117 00:08:58,109 --> 00:09:02,900 The next thing to try is RMS prop. That's very simple to implement if you do 118 00:09:02,900 --> 00:09:07,629 it without momentum, and in my experiment so far, that seems to work as well as 119 00:09:07,629 --> 00:09:10,480 gradient descent with momentum, would be better. 120 00:09:11,880 --> 00:09:17,344 You can also consider all sorts of ways of improving rmsprop by adding momentum or 121 00:09:17,344 --> 00:09:21,929 adaptive step sizes for each weight, but that's still basically uncharted 122 00:09:21,929 --> 00:09:25,572 territory. Finally, you could find out whatever Yann 123 00:09:25,572 --> 00:09:30,094 Lecun's latest receipt is and try that. He's probably the person who's tried the 124 00:09:30,094 --> 00:09:35,370 most different ways of getting stochastic gradient descent to work well, and so it's 125 00:09:35,370 --> 00:09:40,709 worth keeping up with whatever he's doing. One question you might ask is why is there 126 00:09:40,709 --> 00:09:44,629 no simple recipe. We have been messing around with neural 127 00:09:44,629 --> 00:09:49,154 nets, including deep neural nets, for more than 25 years now, and you would think 128 00:09:49,154 --> 00:09:53,060 that we would come up with an agreed way of doing the learning. 129 00:09:53,440 --> 00:09:57,340 There's really two reasons I think why there isn't a simple recipe. 130 00:09:58,000 --> 00:10:02,196 First, neural nets differ a lot. Very deep networks, especially ones that 131 00:10:02,196 --> 00:10:06,807 have narrow bottlenecks in them, which I'll come to in later lectures, are very 132 00:10:06,807 --> 00:10:11,595 hard things to optimize and they need methods that can be very sensitive to very 133 00:10:11,595 --> 00:10:14,906 small gradients. Recurring nets are another special case, 134 00:10:14,906 --> 00:10:19,575 they're typically very hard to optimize, if you want them to notice things that 135 00:10:19,575 --> 00:10:24,186 happened a long time in the past and change the weights based on these things 136 00:10:24,186 --> 00:10:29,213 that happened a long time ago. Then there's wide shallow networks, which 137 00:10:29,213 --> 00:10:33,222 are quite different in flavor and are used a lot in practice. 138 00:10:33,222 --> 00:10:37,690 They often can be optimized with methods that are not very accurate. 139 00:10:37,690 --> 00:10:42,158 Because we stop the optimization early before it starts overfitting. 140 00:10:42,158 --> 00:10:47,480 So for these different kinds of networks, there's very different methods that are 141 00:10:47,480 --> 00:10:52,249 probably appropriate. The other consideration is that tasks 142 00:10:52,249 --> 00:10:56,440 differ a lot. Some tasks require very accurate weights. 143 00:10:56,700 --> 00:11:00,440 Some tasks don't require weights to be very accurate at all. 144 00:11:01,100 --> 00:11:08,786 Also there's some tasks that have weird properties, like if your inputs are words 145 00:11:08,786 --> 00:11:14,700 rare words may only occur on one case in a hundred thousand. 146 00:11:14,980 --> 00:11:20,174 That's a very, very different flavor from what happens if your inputs are pixels. 147 00:11:20,174 --> 00:11:25,499 So to summarize we really don't have nice clear cut advice for how to train a neural 148 00:11:25,499 --> 00:11:28,356 net. We have a bunch of rules of sum, it's not 149 00:11:28,356 --> 00:11:33,745 entirely satisfactory, but just think how much better in your all natural work once 150 00:11:33,745 --> 00:11:36,992 we've got this sorted out, and they already work pretty well.