Figuring out how to get the error derivatives for all of the weights in a multilayer network is the key to being able to learn efficient neural networks. But there are a number of other issues that have to be addressed before we actually get a learning procedure that's fully specified. For example, we need to decide how often to update the weights. And we need to decide how to prevent the network from over-fitting very badly if we use a large network. The back propagation algorithm is an efficient way to compute the derivatives with respect to each weight of the error for a single training case. But that's not a learning algorithm. You have to specify a number of other things to get a proper learning procedure. We need to make lots of other decisions. Some of these decisions are about how we're going to optimized, that is how we're going to use the other derivatives on the individual cases, to discover good set of weights. Those will be described in detail in Lecture six. Another set of issues is how do we ensure that the weights that we've learned will generalize well, that is how do we make sure they work on cases that we didn't see during training and Lecture seven will be devoted to that issue. What I'm going to do now is give you a very brief overview of these two sets of issues. So, optimization issues are about how you use the weight derivatives. The first question is how often should you update the weights? We could try out dating the weights after each training case. So, you compute the error derivatives on a training case using back propagation and then, you make a small change to the weights. Obviously, this is going to zigzag around cuz on each training case, you'll get different error derivatives. But on average, if we make the weight changes small enough, it'll go in the right direction. What seems more sensible is to use full batch training, where you do a full sweep through all of the training data, you add together all of the error derivatives you get on the individual cases, and then you take a small step in that direction. A problem with this is that we start off with a bad set of weights, and we might have a very big training set. And we don't want to do all that work of going through the whole training set in order to fix up some weights that we know are pretty bad. Really, we only need to look at a few training cases before we get a reasonable idea of what direction we want to move the weights in. And we don't need to look at a large number of training cases until we get towards the end of learning. So, that gives us mini batch learning, where we take a small random sample of the training cases and we go in that direction. We'll do a little bit of zigzagging, not nearly as much zigzagging as if we did online where we use one training case at a time. And mini batch learning is what people typically do when they're training big neural networks on big data sets. Then there's the issue of how much we update the weights. How big a change we make. So, we could just by hand try and pick some fixed learning rate and then learn the weights by changing each weight by the derivative that we've computed times that learning rate. It seems more sensible to actually adapt the learning rate. We could get the computer to adapt it by, if we're oscillating around, if the error keeps going up and down, then we'll reduce the learning rate. But if we're making steady progress, we might increase the learning rate. We might even have a separate learning rate for each connection in the network, so that some weights learn rapidly and other weights learned more slowly, or we might go even further and say, we don't really want to go in the direction of steepest decent at all. If you look at the figure on the right, when we had a very elongated ellipse, the direction of steepest decent is almost at right angles to the direction to the minimum that we want to find. And this is typical particularly towards the end of learning of most learning problems. So, there are much better directions to go in than the direction of steepest decent. The problem is, it's quite hard to figure out what they are. The second set of issues is to do with how well the network generalizes the cases it didn't see during training. And the problem here is that the training data contains information about the regularities in the mapping from input to output, but it also contains two types of noise. The first type of noise is that the target values may be unreliable. And from neural network, that's usually only a minor worry. The second type of noise is that the sampling error. If we take any particular training set, especially if it's a small one, there will be accident irregularities that are caused by the particular cases that we chose. So, for example, if you show someone some polygons, if you're a bad teacher, you might choose to show them a square and a rectangle. Those are both polygons, but there's no way for someone to realize from that, that polygons might have three sides or seven sides. There's no way for them to understand that the angles don't have to be right angles. If you're a slightly better teacher, you might show them a triangle and a hexagon. But again, from that, they can't tell whether polygons are always convex, and they can't tell whether the angles in polygons are always multiples of 60 degrees. And, however carefully, you choose examples. For any finite set of examples, there'll be accidental regularities. Now, when we fit a model, there's no way it can tell the difference between an accident regularity that's just there because of the particular samples we chose and a real regularity that's, that we'll generalize properly to new cases. So, what the model will do is it will fit both kinds of regularity. And if you've got a big powerful model, it'll be very good at fitting the sampling error, band that will be a real disaster. That will cause it to generalize really badly. This is best understood by looking at a little example. So here, we've got six data points shown in black, and we can fit a straight line to them, but the model has two degree of freedom and it's fitting the six y values, given the six x values, or we can fit a polynomial that has six degrees of freedom. Ad by hand, I've drawn in red, my idea of a polynomial with six degrees of freedom fitting this data. And you'll see the polynomial goes through the data points exactly and so it's a much better fit to the data. But which model do you trust? The complicated model certainly fits the data much better. But it's not economical. For a model to be convincing, what you want it to do is be a simple model that explains a lot of data surprisingly well. And the polynomial doesn't do that. It explains these six data points, but it's got six degrees of freedom. So, wherever these data points were, it won't be able to explain them. We're not surprise that the model as complicated can fit that data very well and it doesn't convince us that this is a good model. So, if you look at the arrow, which output value do you predict for this input value? Well, you'd have to have a lot of faith in the polynomial model in order to predict a value that's outside the range of values in all of the training data you've seen so far. And I think almost everybody would prefer to predict the blue circle that's on the green line rather than the one on the red line. However, if we had ten times as much data, and all of these data points lay very close to the red line, then we would certainly prefer the red line. There's a number of ways to reduce over-fitting that have been developed for neural networks and for many other models, and I'm going to give just a brief survey of them here. There's weight decay where you try and keep the weights of the network small, or try and keep many of the weights at zero. And the idea of this is that it will make the model simpler. It's weight sharing, where again, you make the model simpler, by insisting that many of the weights have the exactly same value as each other. You don't know what the value is and you're going to learn it. But it has to be exactly the same for many of the weights. We'll see that in the next lecture, how weight sharing is used. There's early stopping, where you make yourself a fake test set. And as you're training the net, you peek at what's happening on this fake test set. And once the performance on the fake test set starts getting worse, you stop training. This model averaging where you train not so different neural on that, and you average them together in the hopes that, that will reduce the errors you're making, Those Bayesian fitting of your own eyes, which is just a fancy form of model averaging, is dropped out but you try and make your model more robust by randomly emitting hidden units when you're training it. And this generative pretraining which are somewhat more complicated and I'll describe towards the end of the course.