1 00:00:00,000 --> 00:00:05,108 In this video, we're going to look at stochastic gradient descent learning for a 2 00:00:05,108 --> 00:00:08,794 neural network, Particularly the mini batch version, which 3 00:00:08,794 --> 00:00:13,838 is probably the most widely used learning algorithm for large neural networks. 4 00:00:13,838 --> 00:00:18,753 We've seen this before, but let's start with a reminder about what the error 5 00:00:18,753 --> 00:00:23,797 surface looks like for a linear neuron. The error surface means a surface that 6 00:00:23,797 --> 00:00:29,100 lies in a space where the horizontal axes correspond to the weights of the neural 7 00:00:29,100 --> 00:00:32,010 net. And the vertical axis corresponds to the 8 00:00:32,010 --> 00:00:35,741 error it makes. For a linear neuron with a squared error, 9 00:00:35,741 --> 00:00:38,710 that surface always forms a quadratic bowl. 10 00:00:38,710 --> 00:00:44,435 The vertical cross sections are parabolas, and the horizontal cross sections are 11 00:00:44,435 --> 00:00:47,882 ellipses. For multilayer non linear nets the error 12 00:00:47,882 --> 00:00:52,435 surface is much more complicated, But as long as the weights aren't to big 13 00:00:52,435 --> 00:00:57,481 it's a smooth error surface, and locally it's well approximated by a fraction of a 14 00:00:57,481 --> 00:01:01,234 quadratic bowl. It might not be the bottom of the bowl but 15 00:01:01,234 --> 00:01:05,972 there's a piece of quadratic bowl that will fit the local error surface very 16 00:01:05,972 --> 00:01:09,390 well. If we look at the conversion speed when we 17 00:01:09,390 --> 00:01:13,771 do full-batch learning, when the error surface is a quadratic bubble, 18 00:01:13,771 --> 00:01:18,023 The obvious thing to do is go downhill, this will reduce the error. 19 00:01:18,023 --> 00:01:23,177 But the problem is, that the direction of steepest descent does not point to the 20 00:01:23,177 --> 00:01:26,850 place we want to go to. As you can see in the ellipse, the 21 00:01:26,850 --> 00:01:32,326 direction of steepest descent is almost at rectangles to the direction we want to go 22 00:01:32,326 --> 00:01:34,844 in. You've got a gradient that's very big 23 00:01:34,844 --> 00:01:39,675 across the ellipse, which is the direction which we only want to travel a small 24 00:01:39,675 --> 00:01:44,079 distance, and the gradient's very small along the ellipse, and that's the 25 00:01:44,079 --> 00:01:47,198 direction which we want to travel a large distance. 26 00:01:47,198 --> 00:01:52,082 It's precisely the wrong way around. Now you might think that studying linear 27 00:01:52,082 --> 00:01:57,579 systems like this, is not a good idea if you want to optimize big non-linear nets. 28 00:01:57,579 --> 00:02:02,552 But even for these non-linear multi-line nets, this kind of a problem arises. 29 00:02:02,552 --> 00:02:07,656 It's a very similar problem that arises even though the error surfaces aren't 30 00:02:07,656 --> 00:02:11,974 globally quadratic bowls. Locally they have all these same kind of 31 00:02:11,974 --> 00:02:15,246 properties. That is they tend to be very curved in 32 00:02:15,246 --> 00:02:18,780 some directions, and very uncurved in other directions. 33 00:02:19,480 --> 00:02:25,725 So the way the learning goes wrong if you use a big learning rate is that you slash 34 00:02:25,725 --> 00:02:30,930 to and fro in the directions in which the area surface is very curved. 35 00:02:30,930 --> 00:02:34,288 So we'll say call that slashing across a ravine. 36 00:02:34,288 --> 00:02:38,137 And with the line rate too big you'll actually diverge. 37 00:02:38,137 --> 00:02:43,525 What we want to achieve, is that we go quickly along the ravine in directions 38 00:02:43,525 --> 00:02:46,744 that have small, but very consistent gradients. 39 00:02:46,744 --> 00:02:51,642 And we move slowly in directions with these big, but very inconsistent 40 00:02:51,642 --> 00:02:55,281 gradients. That is if you go in that direction for a 41 00:02:55,281 --> 00:02:58,500 short distance, the gradient will reverse sign. 42 00:03:00,040 --> 00:03:05,196 Before we go into how we achieve that, I need to talk a little bit about stochastic 43 00:03:05,196 --> 00:03:08,240 gradient descent, and the motivation for using it. 44 00:03:08,540 --> 00:03:12,843 If you have a data set that's highly redundant, then if you compute the 45 00:03:12,843 --> 00:03:17,814 gradient for a weight on the first half of the data set, you'll get almost exactly 46 00:03:17,814 --> 00:03:22,300 the same answer as you get if you compute the gradient on the second half. 47 00:03:22,300 --> 00:03:26,681 So it's a complete waste of time to compute the gradient on the whole data 48 00:03:26,681 --> 00:03:29,194 set. You'd be much better off computing the 49 00:03:29,194 --> 00:03:33,868 gradient on a subset of the data, then updating the weights and on the remaining 50 00:03:33,868 --> 00:03:36,906 data, computing the gradient for the updated weights. 51 00:03:36,906 --> 00:03:41,872 We can take that to extremes and say we're going to compute the gradient on a single 52 00:03:41,872 --> 00:03:46,546 training case, we're going to update the weights and then we're going to compute 53 00:03:46,546 --> 00:03:50,227 the gradient on the next training case using those new weights. 54 00:03:50,227 --> 00:03:55,397 That's called online learning. In general, we don't want to go quite that 55 00:03:55,397 --> 00:03:58,815 far. It's usually better to use small mini 56 00:03:58,815 --> 00:04:04,787 batches, typically ten or a 100 or even 1000 examples. One advantage of a small 57 00:04:04,787 --> 00:04:09,710 mini batch, is that less computation is used for actually updating the weights, 58 00:04:09,710 --> 00:04:12,740 cuz you do that less often, compared with online. 59 00:04:12,740 --> 00:04:17,998 Another advantage is that when you compute the gradient, you can compute the gradient 60 00:04:17,998 --> 00:04:22,762 for a whole bunch of cases in parallel. Most computers are very good at doing 61 00:04:22,762 --> 00:04:27,587 matrix, matrix multiplies, and that will allow you to consider a whole bunch of 62 00:04:27,587 --> 00:04:32,784 training cases and apply the weights to a whole bunch of training cases at the same 63 00:04:32,784 --> 00:04:37,548 time to figure out the activities going into the next layer for all of those 64 00:04:37,548 --> 00:04:40,950 training cases. That gives you a matrix, matrix multiply, 65 00:04:40,950 --> 00:04:44,910 and it's very efficient, especially on a graphics processor unit. 66 00:04:44,910 --> 00:04:50,489 One point about using mini batches is you wouldn't want to have a mini batch in 67 00:04:50,489 --> 00:04:55,719 which the answer is always the same and then on the next mini batch have a 68 00:04:55,719 --> 00:05:01,158 different answer that's always the same. That would cause the weights to slosh 69 00:05:01,158 --> 00:05:04,785 unnecessarily. The ideal, if you have say ten classes, 70 00:05:04,785 --> 00:05:10,713 would be to have a mini batch with say ten examples or 100 examples, that has exactly 71 00:05:10,713 --> 00:05:14,200 the same number from each class in the mini batch. 72 00:05:14,200 --> 00:05:19,374 One way to approximate that is simply to take all your data and just put it in 73 00:05:19,374 --> 00:05:24,679 random order and grab random mini-batches. But you must avoid having mini batches 74 00:05:24,679 --> 00:05:29,984 that are very uncharacteristic of the whole set of data because the mini-batches 75 00:05:29,984 --> 00:05:34,755 are all of one class. So basically there's two types of learning 76 00:05:34,755 --> 00:05:39,045 algorithms for neural nets. There's full gradient algorithms, where 77 00:05:39,045 --> 00:05:42,685 you compute the gradient from all of the training cases. 78 00:05:42,685 --> 00:05:47,625 And once you've done that, there's a lot of clever ways to speed up learning. 79 00:05:47,625 --> 00:05:51,525 There's things like nonlinear versions of a method called conjugate gradient. 80 00:05:51,525 --> 00:05:56,465 The optimization community has been studying the general problem of how you 81 00:05:56,465 --> 00:05:59,780 optimize smooth nonlinear functions for many years. 82 00:05:59,780 --> 00:06:04,690 Now multi-layer neural networks are pretty untypical of the kinds of problems they 83 00:06:04,690 --> 00:06:07,589 study. So applying the methods they developed may 84 00:06:07,589 --> 00:06:11,967 need a lot of modification to make them work for these multi-layer neural 85 00:06:11,967 --> 00:06:15,209 networks. But when you have highly redundant and 86 00:06:15,209 --> 00:06:20,280 large training sets, it's nearly always better to use mini batch learning. 87 00:06:20,560 --> 00:06:26,400 The mini batches may need to be quite big, But that's not so bad because big mini 88 00:06:26,400 --> 00:06:28,760 batches are more computationally efficient. 89 00:06:30,380 --> 00:06:34,753 I'm now going to describe a basic mini-batch grading descent linear 90 00:06:34,753 --> 00:06:37,711 algorithm. This is what most people would use when 91 00:06:37,711 --> 00:06:42,020 they started training a big neural net on a big redundant data set. 92 00:06:42,020 --> 00:06:45,220 Tyou start by guessing an initial learning rate, 93 00:06:45,220 --> 00:06:50,555 And you look to see if the network learned satisfactorily or if the error keeps 94 00:06:50,555 --> 00:06:55,356 getting worse, oscillates wildly. If that happens, you reduce the learning 95 00:06:55,356 --> 00:06:58,125 rate. You also look to see if the error is 96 00:06:58,125 --> 00:07:02,142 falling too slowly. You expect that the error might fluctuate 97 00:07:02,142 --> 00:07:06,948 a bit if you measure it on a validation set, because the great electronic 98 00:07:06,948 --> 00:07:10,964 mini-batch is just a rough estimate of the over all gradient. 99 00:07:10,964 --> 00:07:15,836 So you don't want to reduce the learning rate every time the error arises. 100 00:07:15,836 --> 00:07:20,510 But what you're hoping is that the error will fall fairly consistently. 101 00:07:20,510 --> 00:07:25,974 And if it is falling fairly consistently and very slowly, you can probably increase 102 00:07:25,974 --> 00:07:30,227 the learning rate. Once you've got that working, you can then 103 00:07:30,227 --> 00:07:34,740 write a simple program to automate that way of adjusting the learning rate. 104 00:07:35,000 --> 00:07:39,149 One thing that nearly always helps is, towards the end of learning with 105 00:07:39,149 --> 00:07:42,246 mini-batches. It helps to turn down the learning rate. 106 00:07:42,246 --> 00:07:46,570 That's because you're going to get fluctuations in the weights caused by the 107 00:07:46,570 --> 00:07:50,194 fluctuations in the gradients that come from the mini batches. 108 00:07:50,194 --> 00:07:53,934 And you'd like a final set of weights. As a good compromise. 109 00:07:53,934 --> 00:07:58,667 So, when you turn down the learning rate, you're smoothing away those fluctuations, 110 00:07:58,667 --> 00:08:02,700 and getting a final set of weights that's good for many mini-batches. 111 00:08:03,240 --> 00:08:08,006 So a good time to turn down the learning rate is when the error stops decreasing 112 00:08:08,006 --> 00:08:11,243 consistently. And a good criterion for saying the error 113 00:08:11,243 --> 00:08:15,245 stopped decreasing is to use the error on a separate validation set. 114 00:08:15,245 --> 00:08:19,835 That is, it's a bunch of examples that you are not using for training and also 115 00:08:19,835 --> 00:08:22,543 they're not going to be used for your final test.