1 00:00:00,000 --> 00:00:04,471 In this video, we are going to look at a number of issues that arise when using 2 00:00:04,471 --> 00:00:07,075 stochastic gradient descent with mini patches. 3 00:00:07,075 --> 00:00:10,923 There is a large number of tricks that make things work much better. 4 00:00:10,923 --> 00:00:13,753 These are the kind of black outed neural networks. 5 00:00:13,753 --> 00:00:17,150 And I'm going to go over some of the main tricks in this video. 6 00:00:17,150 --> 00:00:21,817 The first issue I want to talk about, is initializing the way it's in your own 7 00:00:21,817 --> 00:00:24,808 network. If two hidden units have exactly the same 8 00:00:24,808 --> 00:00:29,535 weights, the same bioses, with incoming and I current, then they can never become 9 00:00:29,535 --> 00:00:33,664 different from one another. Because they would always get exactly the 10 00:00:33,664 --> 00:00:36,596 same gradient. So, to allow them to learn diffrent 11 00:00:36,596 --> 00:00:40,605 feature detectors, you need to start them off different from one another. 12 00:00:40,605 --> 00:00:44,614 We do this by using small random weights to initialize the weights. 13 00:00:44,614 --> 00:00:48,971 That breaks the symmetry. Those small random weights umm shouldn't 14 00:00:48,971 --> 00:00:52,251 all necessarily be the same size as each other. 15 00:00:52,251 --> 00:00:57,765 So if you've got a hidden unit that has a very big fan in if you use quite big 16 00:00:57,765 --> 00:01:03,627 weights it'll tend to saturate it so you can afford to use much smaller weights for 17 00:01:03,627 --> 00:01:08,319 a hidden unit that has a big fan in. If you have a hidden unit with a very 18 00:01:08,319 --> 00:01:11,128 small fan, then you want to use bigger weights. 19 00:01:11,128 --> 00:01:15,601 And since the weights are random, it scales with the square root of the number 20 00:01:15,601 --> 00:01:18,697 of the weights. And so a good principle is to make the 21 00:01:18,697 --> 00:01:22,711 size of the initial weights be proportional to the square root of the 22 00:01:22,711 --> 00:01:25,973 fan. We can also scale the learning rates for 23 00:01:25,973 --> 00:01:30,724 the weights the same way. One thing that has a surprisingly big 24 00:01:30,724 --> 00:01:36,063 affect on the speed with which a neural network will learn, is shifting the 25 00:01:36,063 --> 00:01:39,480 inputs. That is adding a constant to each of the 26 00:01:39,480 --> 00:01:44,190 components of the inputs. It seems surprising that, that could make 27 00:01:44,190 --> 00:01:48,050 much difference. But when you're using steepest decent, 28 00:01:48,050 --> 00:01:53,483 shifting an input value by adding a constant can make a very big difference. 29 00:01:53,483 --> 00:01:59,416 It usually helps to shift each component of the input, so that averaged over all of 30 00:01:59,416 --> 00:02:04,920 the training data, it has a value of zero. That is, make sure it's mean is zero. 31 00:02:05,600 --> 00:02:09,976 So suppose we have a little neuron-like likeness, just a linear neuron with two 32 00:02:09,976 --> 00:02:13,729 weights. And suppose we have some training cases. 33 00:02:13,729 --> 00:02:18,311 The first training case is where the inputs are 101 and a 101, you should give 34 00:02:18,311 --> 00:02:21,719 an output of two. And the second one says when there are a 35 00:02:21,719 --> 00:02:26,360 101 and 99 you should output a zero. And I'm using color here to indicate which 36 00:02:26,360 --> 00:02:32,088 training case I'm talking about If you look at the error surface you get for 37 00:02:32,088 --> 00:02:35,046 those two training cases, it looks like this. 38 00:02:35,046 --> 00:02:40,626 The green line is the line along which the weights will satisfy the first training 39 00:02:40,626 --> 00:02:46,138 case, and the red line is the line along which the weights will satisfy the second 40 00:02:46,138 --> 00:02:50,607 training case. And what we notice is that they're almost 41 00:02:50,607 --> 00:02:56,570 parallel, and so when you combine them, you get a very elongated ellipse. 42 00:02:56,570 --> 00:03:01,094 One way to think about what's going on here is that, because we're using a 43 00:03:01,094 --> 00:03:05,190 squared error measure, we get a parabolic trough along the red line. 44 00:03:05,190 --> 00:03:09,959 The red line is the bottom of this parabolic trough that tells us the squared 45 00:03:09,959 --> 00:03:14,911 error we'll be getting on the red case. And there's another parabolic trough with 46 00:03:14,911 --> 00:03:19,008 the green line along its bottom. And it turns out, although this may 47 00:03:19,008 --> 00:03:23,532 surprise your spatial intuition. If you add together two parabolic troughs, 48 00:03:23,532 --> 00:03:27,323 you get a quadratic bowl. And elongated quadratic bowl, in this 49 00:03:27,323 --> 00:03:30,074 case. So that's where that error surface came 50 00:03:30,074 --> 00:03:34,462 from. Now, look what happens, if we subtract a 51 00:03:34,462 --> 00:03:38,964 hundred from each of those two inbook components. 52 00:03:38,964 --> 00:03:42,914 We get a completely different area surface. 53 00:03:42,914 --> 00:03:46,864 It's, in this case, it's a circle, it's ideal. 54 00:03:46,864 --> 00:03:52,560 The green line is the line along which the weights add to two. 55 00:03:52,560 --> 00:03:55,345 We're going to take the first weight, and multiply it by one. 56 00:03:55,345 --> 00:03:58,179 We're going to take the second weight and multiply it by one. 57 00:03:58,179 --> 00:04:00,916 And we need to get two. So the weights better add to two. 58 00:04:00,916 --> 00:04:03,995 The red line is the line along which the two weights are equal. 59 00:04:03,995 --> 00:04:07,171 Because we're going to take the first weight, and multiply it by one. 60 00:04:07,171 --> 00:04:10,198 And we're going to take the second weight, and multiply it by -one. 61 00:04:10,201 --> 00:04:13,720 So if the weights are equal, we'll be able to get that zero that we need. 62 00:04:14,360 --> 00:04:19,917 So the error surface in this case is a nice circle where gradient descent is 63 00:04:19,917 --> 00:04:24,320 really easy, and all we did was subtract 100 from every input. 64 00:04:25,380 --> 00:04:30,330 If you're thinking about what happens not with the input but with the hidden units. 65 00:04:30,330 --> 00:04:36,039 It makes sense to have hidden units that are hyperbolic tangents that go between 66 00:04:36,039 --> 00:04:40,339 -one and one. The hyperbolic tangent is simply twice the 67 00:04:40,339 --> 00:04:44,709 logistic -one. And the reason that makes sense is because 68 00:04:44,709 --> 00:04:50,560 then the activities of the hidden units are roughly mean zero and that should make 69 00:04:50,560 --> 00:04:56,128 the learning faster in the next level. Of course, that's only true if the inputs 70 00:04:56,128 --> 00:05:00,640 to the hyperbolic tangents are distributed sensibly around zero. 71 00:05:01,140 --> 00:05:05,302 But in that respect, a hyperbolic tangent is better than a logistic. 72 00:05:05,302 --> 00:05:09,154 However there is other respects in which a logistic is better. 73 00:05:09,154 --> 00:05:12,820 For example, logistic gives you a rug to sweep things under. 74 00:05:13,080 --> 00:05:17,575 It gives an output of zero, and if you make the input even smaller than it was, 75 00:05:17,575 --> 00:05:21,379 the output is still zero. So fluctuations in big native inputs are 76 00:05:21,379 --> 00:05:25,184 ignored by the logistic. For the hyperbolic tangent you have to go 77 00:05:25,184 --> 00:05:28,700 out to the end of its plateaus before it can ignore anything. 78 00:05:30,700 --> 00:05:35,219 Another thing that makes a big difference is scaling the inputs. 79 00:05:35,219 --> 00:05:41,060 When we use the steepest descent, scaling the input values is a very simple thing to 80 00:05:41,060 --> 00:05:44,119 do. We transform them so that each component 81 00:05:44,119 --> 00:05:48,221 of the input has unit variance over the whole training set. 82 00:05:48,221 --> 00:05:55,352 So it has a typical value of one or -one. So, again if we take this simple net with 83 00:05:55,352 --> 00:06:02,503 two rates and we look at the error surface when the first component is very small and 84 00:06:02,503 --> 00:06:08,082 the second component is much bigger. We get an error surface in which we get an 85 00:06:08,082 --> 00:06:12,607 ellipse that has got a very high curvature, when the input components big 86 00:06:12,607 --> 00:06:17,196 because small changes in the weight make a big difference in the output. 87 00:06:17,196 --> 00:06:22,677 And very low curvature in the direction in which the input component is small because 88 00:06:22,677 --> 00:06:27,011 small changes to the weight hardly make any difference to the error. 89 00:06:27,011 --> 00:06:32,045 The color here is indicating which axis we're using, not which training example 90 00:06:32,045 --> 00:06:34,850 we're using, as it did in the previous slide. 91 00:06:34,850 --> 00:06:39,317 If we simply change the variance of the inputs, just re-scale them. 92 00:06:39,317 --> 00:06:44,733 Make the first component ten times as big and the second component ten times as 93 00:06:44,733 --> 00:06:47,780 small, we now get a nice circular error surface. 94 00:06:49,980 --> 00:06:55,716 Shifting and scaling the inputs is a very simple thing to do, but something that's a 95 00:06:55,716 --> 00:07:00,087 bit more complicated. That actually works even better cause it's 96 00:07:00,087 --> 00:07:03,842 guaranteed to give you a circle, a circular error surface. 97 00:07:03,842 --> 00:07:08,416 At least it is for linear neuron. What we do is we try and decorrelate the 98 00:07:08,416 --> 00:07:12,849 components of the input vectors. In other words, if you take two components 99 00:07:12,849 --> 00:07:17,641 and look at how they're correlated with one another over the whole training set. 100 00:07:17,641 --> 00:07:22,134 Like, if you remember the early example how the number of portions of chips. 101 00:07:22,134 --> 00:07:26,028 And the number of portions of ketchup might be highly correlated. 102 00:07:26,028 --> 00:07:28,963 We want to try and get rid of those correlations. 103 00:07:28,963 --> 00:07:34,260 That will make learning much easier. There's actually many ways to de-correlate 104 00:07:34,260 --> 00:07:37,316 things. For those of you who know about principle 105 00:07:37,316 --> 00:07:41,056 components analysis. A very sensible thing to do is apply 106 00:07:41,056 --> 00:07:45,171 principle components analysis. Remove the components that have the 107 00:07:45,171 --> 00:07:49,349 smallest eigenvalues which already achieves some dimensionality reduction. 108 00:07:49,349 --> 00:07:54,711 And then scale the remaining components by dividing them by the square roots of their 109 00:07:55,085 --> 00:07:58,078 eigenvalues. For a linear system, that will give you a 110 00:07:58,078 --> 00:08:01,631 circular error surface. If you don't know about principle 111 00:08:01,631 --> 00:08:04,500 components, we'll cover it later in the course. 112 00:08:05,580 --> 00:08:09,294 Once you got a circular error surface, the gradient points straight towards the 113 00:08:09,294 --> 00:08:14,463 minimum, so learning is really easy. Now, let's talk about a few of the common 114 00:08:14,463 --> 00:08:19,382 problems that people encounter. One thing that can happen is if you start 115 00:08:19,382 --> 00:08:24,908 with a learning rate that's much too big, you drive the hidden units either to be 116 00:08:24,908 --> 00:08:29,356 firmly on, or firmly off. That is the incoming weights are very big 117 00:08:29,356 --> 00:08:34,545 in positive or very big in negative. And their state no longer depends on the 118 00:08:34,545 --> 00:08:39,861 input and of course that means that error root is coming from output won't affect 119 00:08:39,861 --> 00:08:44,472 them, because they are on the plateaus where the derivative is basically zero. 120 00:08:44,472 --> 00:08:48,570 And so learning will stop. Because people are expecting to see local 121 00:08:48,570 --> 00:08:53,350 minimum, when learning stops they say, oh, I'm at a local minimum and the error's 122 00:08:53,350 --> 00:08:56,072 terrible. So there are these really bad local 123 00:08:56,072 --> 00:08:58,008 minimum, Usually that's not true. 124 00:08:58,008 --> 00:09:01,820 Usually it's because you got stuck out on the end of a plateau. 125 00:09:02,660 --> 00:09:07,992 A second problem that occurs, is, if you are classifying things and you're using 126 00:09:07,992 --> 00:09:11,232 either a squared error or a cross entropy error. 127 00:09:11,232 --> 00:09:16,362 The best guessing strategy is normally to make the output unit equal to the 128 00:09:16,362 --> 00:09:19,400 proportion of the time that it should be one. 129 00:09:20,180 --> 00:09:25,002 The network will fairly quickly find that strategy and so the error will fall 130 00:09:25,002 --> 00:09:29,886 quickly, but particularly if the network has many layers it may take a long time 131 00:09:29,886 --> 00:09:34,098 before it improves much on that. Because to improve over the guessing 132 00:09:34,098 --> 00:09:38,677 statedgy it has to get sensible information from the input through all the 133 00:09:38,677 --> 00:09:43,682 hidden layers to the output and that could take a long time to learn if you start 134 00:09:43,682 --> 00:09:47,284 with small weights. So again, you learn quickly and then the 135 00:09:47,284 --> 00:09:52,351 error stops decreasing, and it looks like a local minimum but actually it's another 136 00:09:52,351 --> 00:09:56,720 platter. I mentioned earlier that towards the end 137 00:09:56,720 --> 00:09:59,607 of learning, you should turn down the learning rate. 138 00:09:59,777 --> 00:10:03,910 You should also be careful about turning down the learning rate too soon. 139 00:10:03,910 --> 00:10:08,962 When you turn down the learning rate you reduce the random fluctuations in the area 140 00:10:08,962 --> 00:10:12,270 do to the different gradings on different mini batches. 141 00:10:12,270 --> 00:10:15,337 But of course you also reduce the rate of learning. 142 00:10:15,337 --> 00:10:20,209 So if you look at the red curve you see that when we turn the learning rate down 143 00:10:20,209 --> 00:10:23,577 we got a great win. The error fell but after that we get 144 00:10:23,577 --> 00:10:26,885 slower learning. And if we do that too soon we're gonna 145 00:10:26,885 --> 00:10:31,577 loose relative to the green curve. So don't turn down the learning rate too 146 00:10:31,577 --> 00:10:37,561 soon, not too much. I'm now gonna talk about four main ways to 147 00:10:37,561 --> 00:10:42,727 speed up mini-batch learning a lot. The previous things I talked about were 148 00:10:42,727 --> 00:10:46,447 kind of a bag of tricks for making things work better. 149 00:10:46,447 --> 00:10:51,957 And these are four methods all explicitly designed to make the learning go much 150 00:10:51,957 --> 00:10:56,274 faster. I'm now gonna talk about a mathical 151 00:10:56,274 --> 00:10:59,520 moment. In this method we don't use the gradient 152 00:10:59,520 --> 00:11:04,727 to change the position of the whites. That is, if you think of the whites as a 153 00:11:04,727 --> 00:11:09,663 ball on the error surface, standard gradient descent uses the gradient to 154 00:11:09,663 --> 00:11:14,464 change the position of that ball. You simply multiply the gradient by a 155 00:11:14,464 --> 00:11:18,860 learning rate and change the position of the ball by that vector. 156 00:11:18,860 --> 00:11:24,343 In the momentum method, we use the gradient to accelerate this ball. 157 00:11:24,343 --> 00:11:27,863 That is the gradient changes it's velocity. 158 00:11:27,863 --> 00:11:33,020 And then the velocity is what changes the position of the ball. 159 00:11:33,020 --> 00:11:37,757 The reason that's different is because the bull can have momentum. 160 00:11:37,757 --> 00:11:41,920 That is, it remembers previous gradients in its philosophy. 161 00:11:43,160 --> 00:11:47,707 A second method for speeding up when you're batch learning is to use a separate 162 00:11:47,707 --> 00:11:52,369 adaptive learning rate for each parameter. And then to slowly adjust that learning 163 00:11:52,369 --> 00:11:56,802 rate based on empirical measurements. And the obvious empirical measurement is 164 00:11:56,802 --> 00:12:01,180 are we keeping making progress by changing the weights in the same direction? 165 00:12:01,180 --> 00:12:05,557 Or does the gradient keep oscillating around so that the sign of the grading 166 00:12:05,557 --> 00:12:08,797 keeps changing. If the sign of the grading keeps changing, 167 00:12:08,797 --> 00:12:13,288 what we're going to do is reduce the learning rate and if it keeps staying the 168 00:12:13,288 --> 00:12:15,960 same, we're going to increase the learning rate. 169 00:12:16,780 --> 00:12:22,698 A third method is what I now call rms prop and what we do in this method is we divide 170 00:12:22,698 --> 00:12:27,860 by a running average of the magnitudes of the recent gradients flat weight. 171 00:12:27,860 --> 00:12:32,953 So that if the gradients are big you divided by a large number and if the 172 00:12:32,953 --> 00:12:37,220 gradients is small and you divide then divide by small number. 173 00:12:37,220 --> 00:12:41,900 That will deal very nicely with a wide range of different gradients. 174 00:12:42,220 --> 00:12:47,612 It's actually a mini batch version of just using the sign of the gradient which is a 175 00:12:47,612 --> 00:12:51,736 method called R prompt, that was designed for full batch learning. 176 00:12:51,736 --> 00:12:56,684 The final way of speeding up learning, which is what optimization people would 177 00:12:56,684 --> 00:12:59,857 naturally recommend, is to use full batch learning. 178 00:12:59,857 --> 00:13:04,425 And to use a fancy method that takes curvature information into account. 179 00:13:04,425 --> 00:13:07,279 To adapt that method to work for neural nets. 180 00:13:07,279 --> 00:13:12,038 And then maybe to try and adapt it some more, so it works with mini batches. 181 00:13:12,038 --> 00:13:15,020 I am not going to talk about that in this lecture.