In this video, we are going to look at a number of issues that arise when using stochastic gradient descent with mini patches. There is a large number of tricks that make things work much better. These are the kind of black outed neural networks. And I'm going to go over some of the main tricks in this video. The first issue I want to talk about, is initializing the way it's in your own network. If two hidden units have exactly the same weights, the same bioses, with incoming and I current, then they can never become different from one another. Because they would always get exactly the same gradient. So, to allow them to learn diffrent feature detectors, you need to start them off different from one another. We do this by using small random weights to initialize the weights. That breaks the symmetry. Those small random weights umm shouldn't all necessarily be the same size as each other. So if you've got a hidden unit that has a very big fan in if you use quite big weights it'll tend to saturate it so you can afford to use much smaller weights for a hidden unit that has a big fan in. If you have a hidden unit with a very small fan, then you want to use bigger weights. And since the weights are random, it scales with the square root of the number of the weights. And so a good principle is to make the size of the initial weights be proportional to the square root of the fan. We can also scale the learning rates for the weights the same way. One thing that has a surprisingly big affect on the speed with which a neural network will learn, is shifting the inputs. That is adding a constant to each of the components of the inputs. It seems surprising that, that could make much difference. But when you're using steepest decent, shifting an input value by adding a constant can make a very big difference. It usually helps to shift each component of the input, so that averaged over all of the training data, it has a value of zero. That is, make sure it's mean is zero. So suppose we have a little neuron-like likeness, just a linear neuron with two weights. And suppose we have some training cases. The first training case is where the inputs are 101 and a 101, you should give an output of two. And the second one says when there are a 101 and 99 you should output a zero. And I'm using color here to indicate which training case I'm talking about If you look at the error surface you get for those two training cases, it looks like this. The green line is the line along which the weights will satisfy the first training case, and the red line is the line along which the weights will satisfy the second training case. And what we notice is that they're almost parallel, and so when you combine them, you get a very elongated ellipse. One way to think about what's going on here is that, because we're using a squared error measure, we get a parabolic trough along the red line. The red line is the bottom of this parabolic trough that tells us the squared error we'll be getting on the red case. And there's another parabolic trough with the green line along its bottom. And it turns out, although this may surprise your spatial intuition. If you add together two parabolic troughs, you get a quadratic bowl. And elongated quadratic bowl, in this case. So that's where that error surface came from. Now, look what happens, if we subtract a hundred from each of those two inbook components. We get a completely different area surface. It's, in this case, it's a circle, it's ideal. The green line is the line along which the weights add to two. We're going to take the first weight, and multiply it by one. We're going to take the second weight and multiply it by one. And we need to get two. So the weights better add to two. The red line is the line along which the two weights are equal. Because we're going to take the first weight, and multiply it by one. And we're going to take the second weight, and multiply it by -one. So if the weights are equal, we'll be able to get that zero that we need. So the error surface in this case is a nice circle where gradient descent is really easy, and all we did was subtract 100 from every input. If you're thinking about what happens not with the input but with the hidden units. It makes sense to have hidden units that are hyperbolic tangents that go between -one and one. The hyperbolic tangent is simply twice the logistic -one. And the reason that makes sense is because then the activities of the hidden units are roughly mean zero and that should make the learning faster in the next level. Of course, that's only true if the inputs to the hyperbolic tangents are distributed sensibly around zero. But in that respect, a hyperbolic tangent is better than a logistic. However there is other respects in which a logistic is better. For example, logistic gives you a rug to sweep things under. It gives an output of zero, and if you make the input even smaller than it was, the output is still zero. So fluctuations in big native inputs are ignored by the logistic. For the hyperbolic tangent you have to go out to the end of its plateaus before it can ignore anything. Another thing that makes a big difference is scaling the inputs. When we use the steepest descent, scaling the input values is a very simple thing to do. We transform them so that each component of the input has unit variance over the whole training set. So it has a typical value of one or -one. So, again if we take this simple net with two rates and we look at the error surface when the first component is very small and the second component is much bigger. We get an error surface in which we get an ellipse that has got a very high curvature, when the input components big because small changes in the weight make a big difference in the output. And very low curvature in the direction in which the input component is small because small changes to the weight hardly make any difference to the error. The color here is indicating which axis we're using, not which training example we're using, as it did in the previous slide. If we simply change the variance of the inputs, just re-scale them. Make the first component ten times as big and the second component ten times as small, we now get a nice circular error surface. Shifting and scaling the inputs is a very simple thing to do, but something that's a bit more complicated. That actually works even better cause it's guaranteed to give you a circle, a circular error surface. At least it is for linear neuron. What we do is we try and decorrelate the components of the input vectors. In other words, if you take two components and look at how they're correlated with one another over the whole training set. Like, if you remember the early example how the number of portions of chips. And the number of portions of ketchup might be highly correlated. We want to try and get rid of those correlations. That will make learning much easier. There's actually many ways to de-correlate things. For those of you who know about principle components analysis. A very sensible thing to do is apply principle components analysis. Remove the components that have the smallest eigenvalues which already achieves some dimensionality reduction. And then scale the remaining components by dividing them by the square roots of their eigenvalues. For a linear system, that will give you a circular error surface. If you don't know about principle components, we'll cover it later in the course. Once you got a circular error surface, the gradient points straight towards the minimum, so learning is really easy. Now, let's talk about a few of the common problems that people encounter. One thing that can happen is if you start with a learning rate that's much too big, you drive the hidden units either to be firmly on, or firmly off. That is the incoming weights are very big in positive or very big in negative. And their state no longer depends on the input and of course that means that error root is coming from output won't affect them, because they are on the plateaus where the derivative is basically zero. And so learning will stop. Because people are expecting to see local minimum, when learning stops they say, oh, I'm at a local minimum and the error's terrible. So there are these really bad local minimum, Usually that's not true. Usually it's because you got stuck out on the end of a plateau. A second problem that occurs, is, if you are classifying things and you're using either a squared error or a cross entropy error. The best guessing strategy is normally to make the output unit equal to the proportion of the time that it should be one. The network will fairly quickly find that strategy and so the error will fall quickly, but particularly if the network has many layers it may take a long time before it improves much on that. Because to improve over the guessing statedgy it has to get sensible information from the input through all the hidden layers to the output and that could take a long time to learn if you start with small weights. So again, you learn quickly and then the error stops decreasing, and it looks like a local minimum but actually it's another platter. I mentioned earlier that towards the end of learning, you should turn down the learning rate. You should also be careful about turning down the learning rate too soon. When you turn down the learning rate you reduce the random fluctuations in the area do to the different gradings on different mini batches. But of course you also reduce the rate of learning. So if you look at the red curve you see that when we turn the learning rate down we got a great win. The error fell but after that we get slower learning. And if we do that too soon we're gonna loose relative to the green curve. So don't turn down the learning rate too soon, not too much. I'm now gonna talk about four main ways to speed up mini-batch learning a lot. The previous things I talked about were kind of a bag of tricks for making things work better. And these are four methods all explicitly designed to make the learning go much faster. I'm now gonna talk about a mathical moment. In this method we don't use the gradient to change the position of the whites. That is, if you think of the whites as a ball on the error surface, standard gradient descent uses the gradient to change the position of that ball. You simply multiply the gradient by a learning rate and change the position of the ball by that vector. In the momentum method, we use the gradient to accelerate this ball. That is the gradient changes it's velocity. And then the velocity is what changes the position of the ball. The reason that's different is because the bull can have momentum. That is, it remembers previous gradients in its philosophy. A second method for speeding up when you're batch learning is to use a separate adaptive learning rate for each parameter. And then to slowly adjust that learning rate based on empirical measurements. And the obvious empirical measurement is are we keeping making progress by changing the weights in the same direction? Or does the gradient keep oscillating around so that the sign of the grading keeps changing. If the sign of the grading keeps changing, what we're going to do is reduce the learning rate and if it keeps staying the same, we're going to increase the learning rate. A third method is what I now call rms prop and what we do in this method is we divide by a running average of the magnitudes of the recent gradients flat weight. So that if the gradients are big you divided by a large number and if the gradients is small and you divide then divide by small number. That will deal very nicely with a wide range of different gradients. It's actually a mini batch version of just using the sign of the gradient which is a method called R prompt, that was designed for full batch learning. The final way of speeding up learning, which is what optimization people would naturally recommend, is to use full batch learning. And to use a fancy method that takes curvature information into account. To adapt that method to work for neural nets. And then maybe to try and adapt it some more, so it works with mini batches. I am not going to talk about that in this lecture.