Now that we have the preliminaries out the way, we can get back to the central issue, which is how to learn multiple layers of features. So in this video, I'm finally going to describe the back propagation algorithm which was the main advance in the 1980s that led to an explosion of interest in neural networks. Before I describe back propagation, I'm going to describe another very obvious algorithm that does not work nearly as well, but is something that many people think of. Now that we know how to learn the weights of the logistic units, we're going to return to the central issue, which is how to learn the weights of hidden units. If you have neural networks without hidden units, they are very limited in the mappings they can model. If you add a layer of hand coded features as in a perceptron, you make the net much more powerful but the difficult bit for a new task is designing the features. The learning won't solve the hard problem; you have to solve it by hand. What we'd like is a way of finding good features without requiring insights into the tasks or repeated trial and error, where we guess some features and see how well they work. In effect, what we need to do is automate the loop of designing features for a task and seeing how well they work. We'd like the computer to do that loop, instead of having a person in that loop. So the thing that occurs to everybody who knows about evolution is to learn by perturbing the weights. You randomly perturb one weight. So that's meant to be like a mutation, and you see if it improves performance. And if it improves performance of the net, you save that change in the weight. You can think of this as a form of reinforcement learning. Your action consists of making a small change. And then you check whether that pays off, and if it does, you decide to perform that action. The problem is it's very inefficient. Just to decide whether to change one weight, we need to do multiple forward passes on a representative set of training cases. We have to see if changing that weight improves things, and you can't judge that by one training case alone. Relative to this method of randomly changing weight, and seeing if it helps, back propagation is much more efficient. It's actually more efficient by a factor of the number of weights in the network, which could be millions. An additional problem with randomly changing weights and seeing if it helps is that towards the end of learning, any large change in weight will nearly always make things worse, because the weights have to have the right relative values to work properly. So towards the end of learning not only do you have to do a lot of work to decide whether each of these changes helps but the changes themselves have to be very small. There are slightly better ways of using perturbations in order to learn. One thing we might try is to perturb all the weights in parallel and then correlate the performance gain with the weight changes. That actually doesn't really help at all. The problem is that we need to do lots and lots of trials with different random perturbation of all the weights, in order to see the effect of changing one weight, through the noise created by changing all the other weights. So it doesn't help to do it all in parallel. Something that does help, is to randomly perturb the activities of the hidden units, instead of perturbing the weight. Once you've decided that perturbing the activity of a hidden unit on a particular training case is going to make things better. You can then compute how to change the weights. Since there's many fewer activities than weights, there's less things that you're randomly exploring. And this makes the algorithm more efficient. But it's still much less efficient than backpropagation. Backpropagation still wins by a factor of the number of neurons. So the idea behind back propagation is that we don't know what the hidden units ought to be doing. They're called hidden units because nobody's telling us what their states ought to be. But we can compute how fast the error changes as we change a hidden activity on a particular training case. So instead of using activities of the hidden units as our desired states, we use the error derivatives with respect to our activities. Since each hidden unit can affect many different output units, it can have many different effects on the overall error if we have many output units. These affects have to be combined. And we can do that efficiently. So that allows us to compute error derivatives for all of the hidden units efficiently at the same time. Once we've got those error derivatives for the hidden units, that is, we know how fast the error changes as we changed the hidden activity on that particular training case, it's easy to convert those error derivatives for the activities into error derivatives for the weights coming into a hidden unit. So here's a sketch of how backpropagation works, for a single training case. First we have to define the error, and here we'll use the error being the square difference between the target values of the output unit J and the actual value that the net produces for the output unit J. And we're gonna imagine there are several output units in this case. We differentiate that, and we get a familiar expression for how the error changes as you change the activity of an output unit J. And I'll use a notation here where the index on a unit will tell you which layer it's in. So the output layer has a typical index of J, and the layer in front of that, the hidden layer below it in the diagram, will have a typical index of I. And I won't bother to say which layer we're in because the index will tell you. So once we've got the aeroderivative with respect to the output of one of these output units, we then want to use all those aeroderivatives in the output layer to compute the same quantity in the hidden layer that comes before the output layer. So back propagation, the core of back propagation is taking error derivatives in one layer and from them computing the error derivatives in the layer that comes before that. So we want to compute DE by DY, I. Now obviously, when we change the output of unit I, it'll change the activities of all three of those output units, and so we have to sum up all those effects. So we're going to have an algorithm that takes error derivatives we've already computed for the top layer here. And combines them using the same weights as we use in the forward pass to get error derivatives in the layer below. So, this slide is going to explain the backpropagation algorithm. And you really need to understand this slide. And the first time you see it, you may have to study it for a long time. This is how you backpropagate the error derivative with respect to the output of a unit. So we'll consider an output unit J on a hidden unit I. The output of the hidden unit I will be YI. The output of the output unit J will be YJ. And the total input received by the output unit J will be ZJ. The first thing we need to do is convert the error derivative with respect to Y J, into an error derivative with respect to Z J. To do that we use the chain rule. So we say DE by DZJ, equals DYJ by DZJ, times DE by DYJ. And af, as we've seen before, when we were looking at logistic units, that's just YJ into one minus YJ times the error derivative with respect to the output of unit J. So now we've got the error derivative with respect to the total input received by unit J. Now we can compute the error derivative with respect to the output of unit I. It's going to be the sum over all of the three outgoing connections of unit I, of this quantity, DZJ by DYI times DE by DZJ. So the first term there is how the total input to unit J changes as we change the output of unit I. And then we have to multiply that by how the error root of changes as we change the total input to unit J which we computed on the line above. And as we saw before when studying the logistic unit dzj by dyi is just the weight on the connection wij. So what we get is that the error derivative. We respect to the output of unit I is the sum over all the outgoing connections to the layer above of the weight wij on that connection times a quantity we would have already computed which is de by dzj for the layer above. And so you can see the computation looks very like what we do on the forward pass, but we're going in the other direction. What we do for each unit in that hidden layer that contains I, is we compute the sum of a quantity in the layer above the weights on the connections. Once we've got to E by DZJ, which we computed on the first line here, it's very easy to get the error derivatives for all the weights coming into unit J. To E by DWIJ is simply D, E, by DZJ, which we computed already, times how ZJ changes. As we change the weight on the connection. And that's simply the activity of the unit in the layer below YI. So the rule for changing the weight is just you multiply, this quantity you've computed at a unit, to E by DZJ, by the activity coming in from the layer below. And that gives you the error of derivative with respect to weight. So on this slide we have seen how we can stop with DE by DYJ and back propagate to get DE by DYI we'll come backwards through one layer and computed the same quantity the derivative of the error with respect to the output in the previous layer. So we can clearly do that for as many layers as we like. And after we've done that for all these layers, we can compute how the error changes as you change the weights on the connections. That's the backpropagation algorithm. It's an algorithm for taking one training case, and computing, efficiently, for every weight in the network, how the error will change as, on that particular training case, as you change the weight.