1 00:00:00,000 --> 00:00:04,015 Figuring out how to get the error derivatives for all of the weights in a 2 00:00:04,015 --> 00:00:08,591 multilayer network is the key to being able to learn efficient neural networks. 3 00:00:08,591 --> 00:00:12,890 But there are a number of other issues that have to be addressed before we 4 00:00:12,890 --> 00:00:16,120 actually get a learning procedure that's fully specified. 5 00:00:16,120 --> 00:00:20,305 For example, we need to decide how often to update the weights. 6 00:00:20,305 --> 00:00:26,428 And we need to decide how to prevent the network from over-fitting very badly if we 7 00:00:26,428 --> 00:00:30,398 use a large network. The back propagation algorithm is an 8 00:00:30,398 --> 00:00:36,055 efficient way to compute the derivatives with respect to each weight of the error 9 00:00:36,055 --> 00:00:40,476 for a single training case. But that's not a learning algorithm. 10 00:00:40,476 --> 00:00:46,614 You have to specify a number of other things to get a proper learning procedure. 11 00:00:46,614 --> 00:00:52,577 We need to make lots of other decisions. Some of these decisions are about how 12 00:00:52,577 --> 00:00:57,533 we're going to optimized, that is how we're going to use the other derivatives 13 00:00:57,533 --> 00:01:00,911 on the individual cases, to discover good set of weights. 14 00:01:00,911 --> 00:01:03,828 Those will be described in detail in Lecture six. 15 00:01:03,829 --> 00:01:09,226 Another set of issues is how do we ensure that the weights that we've learned will 16 00:01:09,226 --> 00:01:14,072 generalize well, that is how do we make sure they work on cases that we didn't see 17 00:01:14,072 --> 00:01:18,015 during training and Lecture seven will be devoted to that issue. 18 00:01:18,092 --> 00:01:23,046 What I'm going to do now is give you a very brief overview of these two sets of 19 00:01:23,046 --> 00:01:27,063 issues. So, optimization issues are about how you 20 00:01:27,063 --> 00:01:32,053 use the weight derivatives. The first question is how often should you 21 00:01:32,053 --> 00:01:36,085 update the weights? We could try out dating the weights after 22 00:01:36,085 --> 00:01:40,076 each training case. So, you compute the error derivatives on a 23 00:01:40,076 --> 00:01:45,063 training case using back propagation and then, you make a small change to the 24 00:01:45,063 --> 00:01:47,856 weights. Obviously, this is going to zigzag around 25 00:01:47,856 --> 00:01:52,075 cuz on each training case, you'll get different error derivatives. 26 00:01:52,075 --> 00:01:57,049 But on average, if we make the weight changes small enough, it'll go in the 27 00:01:57,049 --> 00:02:01,096 right direction. What seems more sensible is to use full 28 00:02:01,096 --> 00:02:06,490 batch training, where you do a full sweep through all of the training data, you add 29 00:02:06,490 --> 00:02:11,434 together all of the error derivatives you get on the individual cases, and then you 30 00:02:11,434 --> 00:02:17,078 take a small step in that direction. A problem with this is that we start off 31 00:02:17,078 --> 00:02:20,692 with a bad set of weights, and we might have a very big training set. 32 00:02:20,692 --> 00:02:24,995 And we don't want to do all that work of going through the whole training set in 33 00:02:24,995 --> 00:02:27,926 order to fix up some weights that we know are pretty bad. 34 00:02:27,926 --> 00:02:32,498 Really, we only need to look at a few training cases before we get a reasonable 35 00:02:32,498 --> 00:02:35,574 idea of what direction we want to move the weights in. 36 00:02:35,574 --> 00:02:39,501 And we don't need to look at a large number of training cases until we get 37 00:02:39,501 --> 00:02:43,319 towards the end of learning. So, that gives us mini batch learning, 38 00:02:43,319 --> 00:02:47,601 where we take a small random sample of the training cases and we go in that 39 00:02:47,601 --> 00:02:50,622 direction. We'll do a little bit of zigzagging, not 40 00:02:50,622 --> 00:02:56,596 nearly as much zigzagging as if we did online where we use one training case at a 41 00:02:56,596 --> 00:02:59,306 time. And mini batch learning is what people 42 00:02:59,306 --> 00:03:04,053 typically do when they're training big neural networks on big data sets. 43 00:03:04,053 --> 00:03:08,323 Then there's the issue of how much we update the weights. 44 00:03:08,323 --> 00:03:12,826 How big a change we make. So, we could just by hand try and pick 45 00:03:12,826 --> 00:03:18,937 some fixed learning rate and then learn the weights by changing each weight by the 46 00:03:18,937 --> 00:03:23,053 derivative that we've computed times that learning rate. 47 00:03:23,053 --> 00:03:26,946 It seems more sensible to actually adapt the learning rate. 48 00:03:26,946 --> 00:03:31,982 We could get the computer to adapt it by, if we're oscillating around, if the error 49 00:03:31,982 --> 00:03:35,824 keeps going up and down, then we'll reduce the learning rate. 50 00:03:35,824 --> 00:03:40,800 But if we're making steady progress, we might increase the learning rate. 51 00:03:40,800 --> 00:03:46,061 We might even have a separate learning rate for each connection in the network, 52 00:03:46,061 --> 00:03:50,770 so that some weights learn rapidly and other weights learned more slowly, or we 53 00:03:50,770 --> 00:03:55,658 might go even further and say, we don't really want to go in the direction of 54 00:03:55,658 --> 00:03:59,488 steepest decent at all. If you look at the figure on the right, 55 00:03:59,488 --> 00:04:04,722 when we had a very elongated ellipse, the direction of steepest decent is almost at 56 00:04:04,722 --> 00:04:08,733 right angles to the direction to the minimum that we want to find. 57 00:04:08,733 --> 00:04:13,497 And this is typical particularly towards the end of learning of most learning 58 00:04:13,497 --> 00:04:16,615 problems. So, there are much better directions to go 59 00:04:16,615 --> 00:04:21,853 in than the direction of steepest decent. The problem is, it's quite hard to figure 60 00:04:21,853 --> 00:04:27,595 out what they are. The second set of issues is to do with how 61 00:04:27,595 --> 00:04:32,422 well the network generalizes the cases it didn't see during training. 62 00:04:32,422 --> 00:04:38,840 And the problem here is that the training data contains information about the 63 00:04:38,840 --> 00:04:45,548 regularities in the mapping from input to output, but it also contains two types of 64 00:04:45,548 --> 00:04:49,193 noise. The first type of noise is that the target 65 00:04:49,193 --> 00:04:53,085 values may be unreliable. And from neural network, that's usually 66 00:04:53,085 --> 00:04:56,904 only a minor worry. The second type of noise is that the 67 00:04:56,904 --> 00:05:00,476 sampling error. If we take any particular training set, 68 00:05:00,476 --> 00:05:06,490 especially if it's a small one, there will be accident irregularities that are caused 69 00:05:06,490 --> 00:05:12,030 by the particular cases that we chose. So, for example, if you show someone some 70 00:05:12,030 --> 00:05:17,060 polygons, if you're a bad teacher, you might choose to show them a square and a 71 00:05:17,060 --> 00:05:21,017 rectangle. Those are both polygons, but there's no 72 00:05:21,017 --> 00:05:26,017 way for someone to realize from that, that polygons might have three sides or seven 73 00:05:26,017 --> 00:05:29,012 sides. There's no way for them to understand that 74 00:05:29,012 --> 00:05:34,000 the angles don't have to be right angles. If you're a slightly better teacher, you 75 00:05:34,000 --> 00:05:39,024 might show them a triangle and a hexagon. But again, from that, they can't tell 76 00:05:39,024 --> 00:05:44,027 whether polygons are always convex, and they can't tell whether the angles in 77 00:05:44,027 --> 00:05:47,001 polygons are always multiples of 60 degrees. 78 00:05:47,001 --> 00:05:49,334 And, however carefully, you choose examples. 79 00:05:49,334 --> 00:05:53,940 For any finite set of examples, there'll be accidental regularities. 80 00:05:53,940 --> 00:06:00,052 Now, when we fit a model, there's no way it can tell the difference between an 81 00:06:00,052 --> 00:06:06,054 accident regularity that's just there because of the particular samples we chose 82 00:06:06,054 --> 00:06:11,089 and a real regularity that's, that we'll generalize properly to new cases. 83 00:06:12,048 --> 00:06:16,039 So, what the model will do is it will fit both kinds of regularity. 84 00:06:16,080 --> 00:06:21,083 And if you've got a big powerful model, it'll be very good at fitting the sampling 85 00:06:21,083 --> 00:06:25,811 error, band that will be a real disaster. That will cause it to generalize really 86 00:06:25,811 --> 00:06:29,259 badly. This is best understood by looking at a 87 00:06:29,259 --> 00:06:34,548 little example. So here, we've got six data points shown 88 00:06:34,548 --> 00:06:40,976 in black, and we can fit a straight line to them, but the model has two degree of 89 00:06:40,976 --> 00:06:47,156 freedom and it's fitting the six y values, given the six x values, or we can fit a 90 00:06:47,156 --> 00:06:50,453 polynomial that has six degrees of freedom. 91 00:06:50,453 --> 00:06:56,453 Ad by hand, I've drawn in red, my idea of a polynomial with six degrees of freedom 92 00:06:56,453 --> 00:07:01,006 fitting this data. And you'll see the polynomial goes through 93 00:07:01,006 --> 00:07:05,076 the data points exactly and so it's a much better fit to the data. 94 00:07:05,076 --> 00:07:11,056 But which model do you trust? The complicated model certainly fits the 95 00:07:11,056 --> 00:07:15,018 data much better. But it's not economical. 96 00:07:16,000 --> 00:07:20,052 For a model to be convincing, what you want it to do is be a simple model that 97 00:07:20,052 --> 00:07:24,093 explains a lot of data surprisingly well. And the polynomial doesn't do that. 98 00:07:24,093 --> 00:07:28,099 It explains these six data points, but it's got six degrees of freedom. 99 00:07:28,099 --> 00:07:33,000 So, wherever these data points were, it won't be able to explain them. 100 00:07:34,045 --> 00:07:38,067 We're not surprise that the model as complicated can fit that data very well 101 00:07:38,067 --> 00:07:41,057 and it doesn't convince us that this is a good model. 102 00:07:42,018 --> 00:07:47,020 So, if you look at the arrow, which output value do you predict for this input value? 103 00:07:47,020 --> 00:07:52,034 Well, you'd have to have a lot of faith in the polynomial model in order to predict a 104 00:07:52,034 --> 00:07:57,043 value that's outside the range of values in all of the training data you've seen so 105 00:07:57,043 --> 00:08:00,021 far. And I think almost everybody would prefer 106 00:08:00,021 --> 00:08:05,023 to predict the blue circle that's on the green line rather than the one on the red 107 00:08:05,023 --> 00:08:09,029 line. However, if we had ten times as much data, 108 00:08:09,029 --> 00:08:14,027 and all of these data points lay very close to the red line, then we would 109 00:08:14,027 --> 00:08:21,290 certainly prefer the red line. There's a number of ways to reduce 110 00:08:21,290 --> 00:08:26,931 over-fitting that have been developed for neural networks and for many other models, 111 00:08:26,931 --> 00:08:30,655 and I'm going to give just a brief survey of them here. 112 00:08:30,655 --> 00:08:35,749 There's weight decay where you try and keep the weights of the network small, or 113 00:08:35,749 --> 00:08:41,083 try and keep many of the weights at zero. And the idea of this is that it will make 114 00:08:41,083 --> 00:08:45,360 the model simpler. It's weight sharing, where again, you make 115 00:08:45,360 --> 00:08:50,281 the model simpler, by insisting that many of the weights have the exactly same value 116 00:08:50,281 --> 00:08:53,038 as each other. You don't know what the value is and 117 00:08:53,038 --> 00:08:56,098 you're going to learn it. But it has to be exactly the same for many 118 00:08:56,098 --> 00:08:59,098 of the weights. We'll see that in the next lecture, how 119 00:08:59,098 --> 00:09:03,998 weight sharing is used. There's early stopping, where you make 120 00:09:03,998 --> 00:09:08,076 yourself a fake test set. And as you're training the net, you peek 121 00:09:08,076 --> 00:09:14,021 at what's happening on this fake test set. And once the performance on the fake test 122 00:09:14,021 --> 00:09:17,000 set starts getting worse, you stop training. 123 00:09:17,029 --> 00:09:21,574 This model averaging where you train not so different neural on that, and you 124 00:09:21,574 --> 00:09:26,072 average them together in the hopes that, that will reduce the errors you're making, 125 00:09:27,036 --> 00:09:32,044 Those Bayesian fitting of your own eyes, which is just a fancy form of model 126 00:09:32,044 --> 00:09:37,988 averaging, is dropped out but you try and make your model more robust by randomly 127 00:09:37,988 --> 00:09:41,015 emitting hidden units when you're training it. 128 00:09:42,008 --> 00:09:46,048 And this generative pretraining which are somewhat more complicated and I'll 129 00:09:46,048 --> 00:09:48,067 describe towards the end of the course.