1 00:00:00,000 --> 00:00:04,095 In this video we're going to look at the error surface for a linear neuron. 2 00:00:04,095 --> 00:00:10,016 By understanding the shape of this error surface, we can understand a lot about 3 00:00:10,016 --> 00:00:13,007 what happens as a linear neuron is learning. 4 00:00:13,007 --> 00:00:18,025 We can get a nice geometrical understanding of what's happening when we 5 00:00:18,025 --> 00:00:23,080 learn the weights of a linear neuron. By considering a space that's very like 6 00:00:23,080 --> 00:00:29,027 the weight space that we use to understand perceptrons, but it has one extra 7 00:00:29,027 --> 00:00:32,080 dimension. So we imagine a space in which all the 8 00:00:32,080 --> 00:00:36,026 horizontal dimensions correspond to the weights. 9 00:00:36,026 --> 00:00:40,095 And there's one vertical dimension that corresponds to the error. 10 00:00:40,095 --> 00:00:46,085 So in this space, points on the horizontal plane, correspond to different settings of 11 00:00:46,085 --> 00:00:50,050 the weights. And the height corresponds to the error 12 00:00:50,050 --> 00:00:55,064 that your making with that set of weights, summed over all training cases. 13 00:00:55,096 --> 00:01:03,073 For a linear neuron, the errors you make for each set of weights define error 14 00:01:03,073 --> 00:01:07,597 surface. And this error surface is a quadratic 15 00:01:07,597 --> 00:01:10,362 bowl. That is, if you take a vertical 16 00:01:10,362 --> 00:01:15,260 cross-section, it's always a parabola. And if you take a horizontal 17 00:01:15,260 --> 00:01:20,956 cross-section, it's always an ellipse. This is only true for linear systems with 18 00:01:20,956 --> 00:01:25,162 a squared error. As soon as we go to a multilayer nonlinear 19 00:01:25,162 --> 00:01:29,202 neuron nets, this error surface will get more complicated. 20 00:01:29,202 --> 00:01:35,067 As long as the weights aren't too big, the error surface will still be smooth, but it 21 00:01:35,067 --> 00:01:42,091 may have many local minimum. Using this error surface we can get a 22 00:01:42,091 --> 00:01:47,741 picture of what's happening as we do gradient descent learning using the delta 23 00:01:47,741 --> 00:01:51,645 rule. So what the delta rule does is it computes 24 00:01:51,645 --> 00:01:55,984 the derivative of the error with respect to the weights. 25 00:01:55,984 --> 00:02:00,961 And if you change the weights in proportion to that derivative, that's 26 00:02:00,961 --> 00:02:05,598 equivalent to doing steepest descent on the error surface. 27 00:02:05,598 --> 00:02:11,264 To put it another way, if we look at the error surface from above, we get 28 00:02:11,264 --> 00:02:15,956 elliptical contour lines. And the delta rule is gonna take it at 29 00:02:15,956 --> 00:02:21,219 right angles to those elliptical contour lines, as shown in the picture. 30 00:02:21,219 --> 00:02:26,757 That's what happens with what's called batch learning, where we get the grayed 31 00:02:26,757 --> 00:02:31,905 in, summed overall training cases. But we could also do online learning, 32 00:02:31,905 --> 00:02:38,582 where after each training case, we change the weights in proportion to the gradient 33 00:02:38,582 --> 00:02:44,079 for that single training case. That's much more like what we do in 34 00:02:44,079 --> 00:02:48,493 perceptrons. And, as you can see, the change in the 35 00:02:48,493 --> 00:02:53,241 weights moves us towards one of these constraint planes. 36 00:02:53,241 --> 00:02:57,617 So in the picture on the right, there are two training cases. 37 00:02:57,617 --> 00:03:04,021 To get the first training case correct, we must lie on one of those blue lines. 38 00:03:04,021 --> 00:03:09,883 And to get the second training case correct, the two weights must lie on the 39 00:03:09,883 --> 00:03:14,094 other blue line. So if we start at one of those red points, 40 00:03:14,094 --> 00:03:20,391 and we compute the gradient on the first training case, the delta rule will move us 41 00:03:20,391 --> 00:03:25,526 perpendicularly towards that line. If we then consider the other training 42 00:03:25,526 --> 00:03:28,927 case, we'll move perpendicularly towards the other line. 43 00:03:28,927 --> 00:03:33,692 And if we alternate between the two training cases, we'll zigzag backwards and 44 00:03:33,692 --> 00:03:38,368 forwards, moving towards the solution point which is where those two lines 45 00:03:38,368 --> 00:03:40,971 intersect. That's the set of weights that is correct 46 00:03:40,971 --> 00:03:47,297 for both training cases. Using this picture of the error surface, 47 00:03:47,297 --> 00:03:52,940 we can also understand the conditions it will make learning very slow. 48 00:03:52,940 --> 00:03:59,315 If that ellipse is very elongated, which is gonna happen if the lines that 49 00:03:59,315 --> 00:04:05,923 correspond to training cases is almost parallel, then when we look at the 50 00:04:05,923 --> 00:04:09,612 gradient, it's going to have a nasty property. 51 00:04:09,612 --> 00:04:14,362 If you look at the red arrow in the picture, the gradient is big in the 52 00:04:14,362 --> 00:04:20,253 direction in which we don't want to move very far, and it's small in the direction 53 00:04:20,253 --> 00:04:26,227 in which we want to move a long way. So the gradient will quickly take us 54 00:04:26,227 --> 00:04:31,584 across the bottom of that ravine. Corresponding to the narrow axis of the 55 00:04:31,584 --> 00:04:35,166 ellipse. And will take a long time to take us along 56 00:04:35,166 --> 00:04:38,880 the ravine, corresponding to the long Xs of the ellipse. 57 00:04:38,880 --> 00:04:43,643 It's just the opposite of what we want. We'd like to get a great into a small 58 00:04:43,643 --> 00:04:48,088 across the ravine, and big along the ravine but that's not what we get. 59 00:04:48,088 --> 00:04:54,022 And so, simple steepest descent, in which you change each weight in proportion to a 60 00:04:54,022 --> 00:04:59,056 learning rate times the error derivative, is gonna have great difficulty, with very 61 00:04:59,056 --> 00:05:03,001 elongated surfaces like the one shown in the picture.