1 00:00:00,000 --> 00:00:05,001 Now that we have the preliminaries out the way, we can get back to the central issue, 2 00:00:05,001 --> 00:00:08,000 which is how to learn multiple layers of features. 3 00:00:08,041 --> 00:00:13,015 So in this video, I'm finally going to describe the back propagation algorithm 4 00:00:13,015 --> 00:00:18,000 which was the main advance in the 1980s that led to an explosion of interest in 5 00:00:18,000 --> 00:00:21,034 neural networks. Before I describe back propagation, I'm 6 00:00:21,034 --> 00:00:26,007 going to describe another very obvious algorithm that does not work nearly as 7 00:00:26,007 --> 00:00:28,098 well, but is something that many people think of. 8 00:00:28,098 --> 00:00:33,095 Now that we know how to learn the weights of the logistic units, we're going to 9 00:00:33,095 --> 00:00:38,085 return to the central issue, which is how to learn the weights of hidden units. 10 00:00:38,085 --> 00:00:43,075 If you have neural networks without hidden units, they are very limited in the 11 00:00:43,075 --> 00:00:48,019 mappings they can model. If you add a layer of hand coded features 12 00:00:48,019 --> 00:00:53,036 as in a perceptron, you make the net much more powerful but the difficult bit for a 13 00:00:53,036 --> 00:00:58,015 new task is designing the features. The learning won't solve the hard problem; 14 00:00:58,015 --> 00:01:03,054 you have to solve it by hand. What we'd like is a way of finding good 15 00:01:03,054 --> 00:01:08,453 features without requiring insights into the tasks or repeated trial and error, 16 00:01:08,453 --> 00:01:13,018 where we guess some features and see how well they work. 17 00:01:13,079 --> 00:01:18,042 In effect, what we need to do is automate the loop of designing features for a task 18 00:01:18,042 --> 00:01:22,026 and seeing how well they work. We'd like the computer to do that loop, 19 00:01:22,026 --> 00:01:28,032 instead of having a person in that loop. So the thing that occurs to everybody who 20 00:01:28,032 --> 00:01:32,030 knows about evolution is to learn by perturbing the weights. 21 00:01:32,030 --> 00:01:37,021 You randomly perturb one weight. So that's meant to be like a mutation, and 22 00:01:37,021 --> 00:01:42,277 you see if it improves performance. And if it improves performance of the net, 23 00:01:42,277 --> 00:01:47,020 you save that change in the weight. You can think of this as a form of 24 00:01:47,020 --> 00:01:51,014 reinforcement learning. Your action consists of making a small 25 00:01:51,014 --> 00:01:54,018 change. And then you check whether that pays off, 26 00:01:54,018 --> 00:01:57,029 and if it does, you decide to perform that action. 27 00:01:58,084 --> 00:02:03,093 The problem is it's very inefficient. Just to decide whether to change one 28 00:02:03,093 --> 00:02:09,051 weight, we need to do multiple forward passes on a representative set of training 29 00:02:09,051 --> 00:02:12,061 cases. We have to see if changing that weight 30 00:02:12,061 --> 00:02:17,030 improves things, and you can't judge that by one training case alone. 31 00:02:17,078 --> 00:02:22,058 Relative to this method of randomly changing weight, and seeing if it helps, 32 00:02:22,058 --> 00:02:27,077 back propagation is much more efficient. It's actually more efficient by a factor 33 00:02:27,077 --> 00:02:31,087 of the number of weights in the network, which could be millions. 34 00:02:33,008 --> 00:02:37,052 An additional problem with randomly changing weights and seeing if it helps is 35 00:02:37,052 --> 00:02:41,096 that towards the end of learning, any large change in weight will nearly always 36 00:02:41,096 --> 00:02:46,046 make things worse, because the weights have to have the right relative values to 37 00:02:46,046 --> 00:02:49,066 work properly. So towards the end of learning not only do 38 00:02:49,066 --> 00:02:54,010 you have to do a lot of work to decide whether each of these changes helps but 39 00:02:54,010 --> 00:02:56,063 the changes themselves have to be very small. 40 00:02:58,097 --> 00:03:04,006 There are slightly better ways of using perturbations in order to learn. 41 00:03:04,006 --> 00:03:10,000 One thing we might try is to perturb all the weights in parallel and then correlate 42 00:03:10,000 --> 00:03:13,018 the performance gain with the weight changes. 43 00:03:13,018 --> 00:03:19,002 That actually doesn't really help at all. The problem is that we need to do lots and 44 00:03:19,002 --> 00:03:23,079 lots of trials with different random perturbation of all the weights, in order 45 00:03:23,079 --> 00:03:28,087 to see the effect of changing one weight, through the noise created by changing all 46 00:03:28,087 --> 00:03:32,010 the other weights. So it doesn't help to do it all in 47 00:03:32,010 --> 00:03:36,011 parallel. Something that does help, is to randomly 48 00:03:36,011 --> 00:03:41,028 perturb the activities of the hidden units, instead of perturbing the weight. 49 00:03:42,019 --> 00:03:47,018 Once you've decided that perturbing the activity of a hidden unit on a particular 50 00:03:47,018 --> 00:03:49,092 training case is going to make things better. 51 00:03:49,092 --> 00:03:52,078 You can then compute how to change the weights. 52 00:03:53,091 --> 00:03:58,050 Since there's many fewer activities than weights, there's less things that you're 53 00:03:58,050 --> 00:04:01,050 randomly exploring. And this makes the algorithm more 54 00:04:01,050 --> 00:04:04,033 efficient. But it's still much less efficient than 55 00:04:04,033 --> 00:04:07,061 backpropagation. Backpropagation still wins by a factor of 56 00:04:07,061 --> 00:04:13,058 the number of neurons. So the idea behind back propagation is 57 00:04:13,058 --> 00:04:17,009 that we don't know what the hidden units ought to be doing. 58 00:04:17,009 --> 00:04:21,050 They're called hidden units because nobody's telling us what their states 59 00:04:21,050 --> 00:04:24,048 ought to be. But we can compute how fast the error 60 00:04:24,048 --> 00:04:28,060 changes as we change a hidden activity on a particular training case. 61 00:04:29,000 --> 00:04:34,072 So instead of using activities of the hidden units as our desired states, we use 62 00:04:34,072 --> 00:04:38,051 the error derivatives with respect to our activities. 63 00:04:39,048 --> 00:04:45,006 Since each hidden unit can affect many different output units, it can have many 64 00:04:45,006 --> 00:04:49,087 different effects on the overall error if we have many output units. 65 00:04:50,016 --> 00:04:56,080 These affects have to be combined. And we can do that efficiently. 66 00:04:57,006 --> 00:05:02,025 So that allows us to compute error derivatives for all of the hidden units 67 00:05:02,025 --> 00:05:07,059 efficiently at the same time. Once we've got those error derivatives for 68 00:05:07,059 --> 00:05:12,030 the hidden units, that is, we know how fast the error changes as we changed the 69 00:05:12,030 --> 00:05:16,089 hidden activity on that particular training case, it's easy to convert those 70 00:05:16,089 --> 00:05:21,091 error derivatives for the activities into error derivatives for the weights coming 71 00:05:21,091 --> 00:05:28,006 into a hidden unit. So here's a sketch of how backpropagation 72 00:05:28,006 --> 00:05:33,064 works, for a single training case. First we have to define the error, and 73 00:05:33,064 --> 00:05:39,022 here we'll use the error being the square difference between the target values of 74 00:05:39,022 --> 00:05:44,074 the output unit J and the actual value that the net produces for the output unit 75 00:05:44,074 --> 00:05:47,073 J. And we're gonna imagine there are several 76 00:05:47,073 --> 00:05:51,088 output units in this case. We differentiate that, and we get a 77 00:05:51,088 --> 00:05:57,026 familiar expression for how the error changes as you change the activity of an 78 00:05:57,026 --> 00:06:00,091 output unit J. And I'll use a notation here where the 79 00:06:00,091 --> 00:06:03,081 index on a unit will tell you which layer it's in. 80 00:06:03,081 --> 00:06:08,044 So the output layer has a typical index of J, and the layer in front of that, the 81 00:06:08,044 --> 00:06:12,038 hidden layer below it in the diagram, will have a typical index of I. 82 00:06:12,038 --> 00:06:16,095 And I won't bother to say which layer we're in because the index will tell you. 83 00:06:18,022 --> 00:06:23,093 So once we've got the aeroderivative with respect to the output of one of these 84 00:06:23,093 --> 00:06:28,860 output units, we then want to use all those aeroderivatives in the output layer 85 00:06:28,860 --> 00:06:35,057 to compute the same quantity in the hidden layer that comes before the output layer. 86 00:06:35,057 --> 00:06:41,014 So back propagation, the core of back propagation is taking error derivatives in 87 00:06:41,014 --> 00:06:46,072 one layer and from them computing the error derivatives in the layer that comes 88 00:06:46,072 --> 00:06:51,024 before that. So we want to compute DE by DY, I. 89 00:06:51,024 --> 00:06:58,091 Now obviously, when we change the output of unit I, it'll change the activities of 90 00:06:58,091 --> 00:07:06,001 all three of those output units, and so we have to sum up all those effects. 91 00:07:06,063 --> 00:07:12,062 So we're going to have an algorithm that takes error derivatives we've already 92 00:07:12,062 --> 00:07:18,037 computed for the top layer here. And combines them using the same weights 93 00:07:18,037 --> 00:07:24,019 as we use in the forward pass to get error derivatives in the layer below. 94 00:07:25,083 --> 00:07:29,014 So, this slide is going to explain the backpropagation algorithm. 95 00:07:29,014 --> 00:07:31,055 And you really need to understand this slide. 96 00:07:31,055 --> 00:07:35,033 And the first time you see it, you may have to study it for a long time. 97 00:07:35,098 --> 00:07:41,064 This is how you backpropagate the error derivative with respect to the output of a 98 00:07:41,064 --> 00:07:46,035 unit. So we'll consider an output unit J on a 99 00:07:46,035 --> 00:07:50,064 hidden unit I. The output of the hidden unit I will be 100 00:07:50,064 --> 00:07:54,005 YI. The output of the output unit J will be 101 00:07:54,005 --> 00:07:57,070 YJ. And the total input received by the output 102 00:07:57,070 --> 00:08:04,011 unit J will be ZJ. The first thing we need to do is convert 103 00:08:04,011 --> 00:08:11,034 the error derivative with respect to Y J, into an error derivative with respect to Z 104 00:08:11,034 --> 00:08:15,079 J. To do that we use the chain rule. 105 00:08:15,079 --> 00:08:22,004 So we say DE by DZJ, equals DYJ by DZJ, times DE by DYJ. 106 00:08:23,016 --> 00:08:29,007 And af, as we've seen before, when we were looking at logistic units, that's just YJ 107 00:08:29,007 --> 00:08:34,049 into one minus YJ times the error derivative with respect to the output of 108 00:08:34,049 --> 00:08:38,010 unit J. So now we've got the error derivative with 109 00:08:38,010 --> 00:08:41,041 respect to the total input received by unit J. 110 00:08:43,010 --> 00:08:48,039 Now we can compute the error derivative with respect to the output of unit I. 111 00:08:48,094 --> 00:08:58,035 It's going to be the sum over all of the three outgoing connections of unit I, of 112 00:08:58,035 --> 00:09:06,089 this quantity, DZJ by DYI times DE by DZJ. So the first term there is how the total 113 00:09:06,089 --> 00:09:11,064 input to unit J changes as we change the output of unit I. 114 00:09:11,064 --> 00:09:18,052 And then we have to multiply that by how the error root of changes as we change the 115 00:09:18,052 --> 00:09:23,027 total input to unit J which we computed on the line above. 116 00:09:23,027 --> 00:09:29,050 And as we saw before when studying the logistic unit dzj by dyi is just the 117 00:09:29,050 --> 00:09:35,065 weight on the connection wij. So what we get is that the error 118 00:09:35,065 --> 00:09:40,047 derivative. We respect to the output of unit I is the 119 00:09:40,047 --> 00:09:48,000 sum over all the outgoing connections to the layer above of the weight wij on that 120 00:09:48,000 --> 00:09:55,045 connection times a quantity we would have already computed which is de by dzj for 121 00:09:55,045 --> 00:09:59,065 the layer above. And so you can see the computation looks 122 00:09:59,065 --> 00:10:05,002 very like what we do on the forward pass, but we're going in the other direction. 123 00:10:05,002 --> 00:10:10,039 What we do for each unit in that hidden layer that contains I, is we compute the 124 00:10:10,039 --> 00:10:15,001 sum of a quantity in the layer above the weights on the connections. 125 00:10:15,058 --> 00:10:23,037 Once we've got to E by DZJ, which we computed on the first line here, it's very 126 00:10:23,037 --> 00:10:30,066 easy to get the error derivatives for all the weights coming into unit J. 127 00:10:30,066 --> 00:10:38,075 To E by DWIJ is simply D, E, by DZJ, which we computed already, times how ZJ changes. 128 00:10:38,075 --> 00:10:47,024 As we change the weight on the connection. And that's simply the activity of the unit 129 00:10:47,024 --> 00:10:51,093 in the layer below YI. So the rule for changing the weight is 130 00:10:51,093 --> 00:10:56,087 just you multiply, this quantity you've computed at a unit, to E by DZJ, by the 131 00:10:56,087 --> 00:11:02,019 activity coming in from the layer below. And that gives you the error of derivative 132 00:11:02,019 --> 00:11:06,084 with respect to weight. So on this slide we have seen how we can 133 00:11:06,084 --> 00:11:13,042 stop with DE by DYJ and back propagate to get DE by DYI we'll come backwards through 134 00:11:13,042 --> 00:11:19,077 one layer and computed the same quantity the derivative of the error with respect 135 00:11:19,077 --> 00:11:25,050 to the output in the previous layer. So we can clearly do that for as many 136 00:11:25,050 --> 00:11:29,057 layers as we like. And after we've done that for all these 137 00:11:29,057 --> 00:11:34,071 layers, we can compute how the error changes as you change the weights on the 138 00:11:34,071 --> 00:11:38,005 connections. That's the backpropagation algorithm. 139 00:11:38,005 --> 00:11:43,019 It's an algorithm for taking one training case, and computing, efficiently, for 140 00:11:43,019 --> 00:11:48,027 every weight in the network, how the error will change as, on that particular 141 00:11:48,027 --> 00:11:50,087 training case, as you change the weight.