1 00:00:00,000 --> 00:00:05,312 In this video, I'm going to talk about the reason why we want to combine many models 2 00:00:05,312 --> 00:00:09,360 when we're making predictions. If we have a single model, we have to 3 00:00:09,360 --> 00:00:13,787 choose some capacity for it. If we choose too little capacity, it would 4 00:00:13,787 --> 00:00:17,140 be able to fit the regularities in the training data. 5 00:00:17,140 --> 00:00:22,563 And if we choose too much capacity, it won't be able to fit the sampling error in 6 00:00:22,563 --> 00:00:27,784 the particular training set we have. By using many models, we can actually get 7 00:00:27,784 --> 00:00:33,546 a better tradeoff between fitting the true regularities, and overfitting the sampling 8 00:00:33,546 --> 00:00:36,530 error in the data. At the start of the video, 9 00:00:36,530 --> 00:00:41,841 I'll show you that when you average models together, you can expect to do better than 10 00:00:41,841 --> 00:00:45,340 any single model. This effect is largest when the models 11 00:00:45,340 --> 00:00:48,339 make very different predictions from each other. 12 00:00:48,339 --> 00:00:53,650 And at the end of this video, I'll discuss various ways in which we can encourage the 13 00:00:53,650 --> 00:00:56,900 different models to make very different predictions. 14 00:00:57,340 --> 00:01:02,169 As we've seen before, when we have a limited amount of training data, we tend 15 00:01:02,169 --> 00:01:06,652 to get overfitting. If we average the predictions of many 16 00:01:06,652 --> 00:01:10,240 different models we can typically reduce that overfitting. 17 00:01:11,320 --> 00:01:16,220 This helps most when the models make very different predictions from one another. 18 00:01:17,620 --> 00:01:23,529 For regression, the squared arrow can be decomposed into a bias term and a variance 19 00:01:23,529 --> 00:01:26,946 term. And that allows us to analyze what's going 20 00:01:26,946 --> 00:01:30,768 on. The bias term is big if the model has too 21 00:01:30,768 --> 00:01:35,177 little capacity to fit the data. It measures how poorly the model 22 00:01:35,177 --> 00:01:40,314 approximates the true function. The variance term is big if the model has 23 00:01:40,314 --> 00:01:44,171 so much capacity that it's good at modeling the sampling error in our 24 00:01:44,171 --> 00:01:47,753 particular training set. So, it's called variance, because if we go 25 00:01:47,753 --> 00:01:52,327 and get another training set of the same size from the same distribution, our model 26 00:01:52,327 --> 00:01:56,846 will fit differently to that training set, because it has different sampling error. 27 00:01:56,846 --> 00:02:01,200 And so we'll get variance in the way the models fit to different training sets. 28 00:02:03,240 --> 00:02:08,067 If we average models together, what we're doing is we're averaging away the 29 00:02:08,067 --> 00:02:11,028 variance, And that allows us to use individual 30 00:02:11,028 --> 00:02:14,825 models that have high capacity and therefore high variance. 31 00:02:14,825 --> 00:02:18,044 These high capacity model typically have low bias. 32 00:02:18,044 --> 00:02:22,678 So we can get the low bias without incurring the high variance by using 33 00:02:22,678 --> 00:02:30,383 averaging to get rid of the variance. So now let's try and analyze how an 34 00:02:30,383 --> 00:02:34,780 individual model compares with an average of models. 35 00:02:36,120 --> 00:02:41,275 On any one test case some individual predictors may be better than the combined 36 00:02:41,275 --> 00:02:45,673 predictor. The different individual predictors will 37 00:02:45,673 --> 00:02:50,497 be better on different cases. And if the individual predictors disagree 38 00:02:50,497 --> 00:02:55,125 a lot, the combined predictor is typically better than all of the individual 39 00:02:55,125 --> 00:02:57,744 predictors when we average over test cases. 40 00:02:57,744 --> 00:03:02,616 So we should aim to make the individual predictors disagree, without making them 41 00:03:02,616 --> 00:03:06,270 be poor predictors. The art is to have individual predictors 42 00:03:06,270 --> 00:03:11,020 that make very different errors from one another, but are each fairly accurate. 43 00:03:13,240 --> 00:03:17,603 So, now let's look at the math and what happens when we combine networks. 44 00:03:17,603 --> 00:03:20,693 We're going to compare two expected squared errors. 45 00:03:20,693 --> 00:03:25,117 The first expected squared error is the one we get if we pick one of the 46 00:03:25,117 --> 00:03:28,814 predictors at random and use that for making our predictions. 47 00:03:28,814 --> 00:03:33,843 And then what we do is we average overall predictors, the error we'd expect to get 48 00:03:33,843 --> 00:03:39,899 if we followed that policy. So Y bar is the average of what all the 49 00:03:39,899 --> 00:03:44,145 predictors say, and YI is what an individual predictor says. 50 00:03:44,145 --> 00:03:50,191 So Y bar is just the expectation over all the individual predictors I of YI and I'm 51 00:03:50,191 --> 00:03:56,093 using those angle brackets to represent an expectation, where the thing that comes 52 00:03:56,093 --> 00:04:00,700 after the angle bracket tells you what it's an expectation over. 53 00:04:01,020 --> 00:04:06,560 We can write the same thing as one over n times the sum overall of the n of the yi. 54 00:04:09,620 --> 00:04:14,405 Now, if we look at the expected squared error we'd get if we chose a predictor at 55 00:04:14,405 --> 00:04:17,005 random, What we'd have to do is compare that 56 00:04:17,005 --> 00:04:20,195 predictor with the target, take the squared difference. 57 00:04:20,195 --> 00:04:25,040 And then average that over all predictors. That's also on the left hand side there. 58 00:04:25,300 --> 00:04:30,448 If I simply add a Y bar and subtract a Y bar, I don't change the value. 59 00:04:30,448 --> 00:04:34,420 And now it's going to be easier to do some manipulations. 60 00:04:36,360 --> 00:04:43,753 I can now multiply it that squared and inside this expectation bracket I have t 61 00:04:43,753 --> 00:04:50,702 minus y bar squared, y I minus y bar square, and t minus y bar into y I minus y 62 00:04:50,702 --> 00:04:58,310 bar, which has the c will disappear. So the first term, T minus Y bar squared, 63 00:04:58,310 --> 00:05:03,457 doesn't have an I in it anymore, and so we can forget about the expectation brackets 64 00:05:03,457 --> 00:05:06,275 for that. That really is T minus Y bar squared. 65 00:05:06,275 --> 00:05:11,238 And that's the squared arrow you'd get if you compared the average of the models 66 00:05:11,238 --> 00:05:14,669 with the target. And our aim is to show the thing on the 67 00:05:14,669 --> 00:05:19,570 left hand side is bigger than that, i.e., by using that average, we've reduced the 68 00:05:19,570 --> 00:05:24,299 expected squared error. So the extra term we have on the right 69 00:05:24,299 --> 00:05:28,205 hand side, is the expectation of y i minus y bar squared. 70 00:05:28,205 --> 00:05:33,575 And that's just the variance of the y i. It's the expected squared difference 71 00:05:33,575 --> 00:05:37,849 between y I and y bar. And then the last tone disappears, it 72 00:05:37,849 --> 00:05:43,846 disappears because the difference of Y I from Y bar we expect to be uncorrelated 73 00:05:43,846 --> 00:05:50,065 with the difference between the arrow that the average of the networks makes on the 74 00:05:50,065 --> 00:05:53,397 target. And so we're multiplying together two 75 00:05:53,397 --> 00:05:59,320 things that are zero mean and uncorrelated and we expect to get zero on average. 76 00:05:59,660 --> 00:06:06,321 So the result is that the expected squared error we get by picking a model at random 77 00:06:06,321 --> 00:06:12,904 is greater than the squared error we get by averaging the models by the variance of 78 00:06:12,904 --> 00:06:18,233 the outputs of the models. That's how much we win by when we take an 79 00:06:18,233 --> 00:06:23,846 average. So, I want to show you that in a picture. 80 00:06:23,846 --> 00:06:28,755 So, along the horizontal line, we have the possible values of the output, and in this 81 00:06:28,755 --> 00:06:32,660 case, all of the different models predict a value that is too high. 82 00:06:33,260 --> 00:06:38,078 The predictors that are further than average from T make bigger than average 83 00:06:38,078 --> 00:06:43,210 squared errors, like that bad guy in red, and the predictors that are less than the 84 00:06:43,210 --> 00:06:47,278 average distance from T make smaller than average squared arrows. 85 00:06:47,278 --> 00:06:51,346 And the first effect dominates, because we're using squared error. 86 00:06:51,346 --> 00:06:56,352 So if you look at the math, let's suppose that the good guy and the bad guy were 87 00:06:56,352 --> 00:07:01,628 equally far from the mean. So the average squared error they make is 88 00:07:01,628 --> 00:07:06,182 Y bar minus epsilon squared plus Y bar plus epsilon squared. 89 00:07:06,182 --> 00:07:11,648 And when we work that out, we get the squared error that the mean of the 90 00:07:11,648 --> 00:07:17,873 predictors makes, plus an epsilon squared. So we win by averaging predictors before 91 00:07:17,873 --> 00:07:22,048 we compare them with the target. That's not always true. 92 00:07:22,048 --> 00:07:25,540 It depends very much on using a squared error. 93 00:07:25,840 --> 00:07:28,985 If, for example, you have a whole bunch of clocks. 94 00:07:28,985 --> 00:07:33,048 And you try and make them more accurate by averaging them all, 95 00:07:33,048 --> 00:07:37,242 That'll be a disaster. And it'll be a disaster because the noise 96 00:07:37,242 --> 00:07:42,746 you expect in clocks isn't Gaussian noise. What you expect is that, many of them will 97 00:07:42,746 --> 00:07:48,185 be very slightly wrong and a few of them will have stopped or will be wildly wrong. 98 00:07:48,185 --> 00:07:53,690 And if you average, you make sure they are all significantly wrong, which is not what 99 00:07:53,690 --> 00:07:56,972 you want. The same thing applies to the discrete 100 00:07:56,972 --> 00:08:00,700 distribution as we have our class labeled probabilities. 101 00:08:01,760 --> 00:08:07,416 So suppose that we have two models, and one gives the correct label of probability 102 00:08:07,416 --> 00:08:11,900 of Pi, and the other gives the correct label of probability of Pj. 103 00:08:13,120 --> 00:08:18,262 Is it better to pick one model at random, or it is it better to average those two 104 00:08:18,262 --> 00:08:21,500 probabilities, and predict the average of Pi and Pj. 105 00:08:21,780 --> 00:08:26,820 What if I had a measure is the log probability of getting the right answer? 106 00:08:27,200 --> 00:08:34,231 Then, the log of the average of Pi and Pj is going to be a better bet than the log 107 00:08:34,231 --> 00:08:40,550 of Pi plus the log of Pj averaged. That's most easily seen in a diagram 108 00:08:40,550 --> 00:08:48,238 because of the shape of the log function. So that black curve is the log. 109 00:08:48,238 --> 00:08:52,600 On the horizontal access I've drawn Pi and Pj, 110 00:08:53,000 --> 00:08:57,500 And the gold colored line, joins log Pi to log Pj. 111 00:08:57,940 --> 00:09:04,608 You can see that if we first start with Pi and Pj together, to get that average value 112 00:09:04,608 --> 00:09:10,100 at the blue arrow is, and then we compute the log, we get that blue dot. 113 00:09:10,480 --> 00:09:16,103 Whereas if we first take the log of pi, and separately take the log of pj, and 114 00:09:16,103 --> 00:09:21,215 then we average those two logs, we get the mid-point of that gold line, 115 00:09:21,215 --> 00:09:28,334 Which is below the blue dot. So to make this averaging be a big win, we 116 00:09:28,334 --> 00:09:33,252 want our predictors to differ by a lot. And there's many different ways to make 117 00:09:33,252 --> 00:09:36,624 them differ. You could just rely on a learning 118 00:09:36,624 --> 00:09:41,652 algorithm that doesn't work too well, and get stuck in different local optima each 119 00:09:41,652 --> 00:09:44,412 time. It's not a very intelligent thing to do, 120 00:09:44,412 --> 00:09:50,056 but it's worth a try. You could use lots of different kinds of 121 00:09:50,056 --> 00:09:53,509 models, including ones that are not neural networks. 122 00:09:53,509 --> 00:09:58,858 So, it makes sense to try decision trees, Gaussian process models, support vector 123 00:09:58,858 --> 00:10:02,176 machines. I'm not explaining any of those in this 124 00:10:02,176 --> 00:10:05,765 course. In Andrew Ng's machine on Coursera, you 125 00:10:05,765 --> 00:10:10,709 can learn about all those things. Well you could try many other different 126 00:10:10,709 --> 00:10:15,478 kinds of model. If you really want to use a bunch of 127 00:10:15,478 --> 00:10:20,115 different neural-network models, you can make them different by using a different 128 00:10:20,115 --> 00:10:24,580 number of hidden layers or a different number of units per layer or different 129 00:10:24,580 --> 00:10:27,214 types of unit. Like in some nets you could use 130 00:10:27,214 --> 00:10:30,649 rectified-linear units, And in other nets you could use logistic 131 00:10:30,649 --> 00:10:33,454 units. You could use different types or strengths 132 00:10:33,454 --> 00:10:36,831 of weight penalty. So you might use early stopping for some 133 00:10:36,831 --> 00:10:41,240 nets, and an L2 weight penalty for others, and an L1 weight penalty for others. 134 00:10:42,320 --> 00:10:44,485 You could use different learning algorithms. 135 00:10:44,485 --> 00:10:48,816 So for example you could use full batch for some, and mini batch for others, if 136 00:10:48,816 --> 00:10:51,260 your data set is small enough to allow that. 137 00:10:51,260 --> 00:10:56,575 You can also make the models differ by training the models on different training 138 00:10:56,575 --> 00:10:59,397 data. So, there's a method introduced by Leo 139 00:10:59,397 --> 00:11:04,646 Breiman called bagging, where you train different models on different subsets of 140 00:11:04,646 --> 00:11:07,993 the data. And you get these subsets by sampling the 141 00:11:07,993 --> 00:11:12,456 training set with replacement. So we sampled a training set that had 142 00:11:12,456 --> 00:11:16,590 examples A, B, C, D, and E. And we got five examples, but we'll have 143 00:11:16,590 --> 00:11:21,315 some missing and some duplicated. And we train one of our models on that 144 00:11:21,315 --> 00:11:25,328 particular training set. This is done in a method called random 145 00:11:25,328 --> 00:11:30,330 forest that uses bagging with decision trees, which Leo Breiman was also involved 146 00:11:30,330 --> 00:11:33,832 in inventing. When you train decision trees with bagging 147 00:11:33,832 --> 00:11:39,022 and then average them together, they work much better than single decision tree bys 148 00:11:39,022 --> 00:11:41,887 themselves. In fact, the connect box uses random 149 00:11:41,887 --> 00:11:46,848 forests to convert information about depth into information about where your body 150 00:11:46,848 --> 00:11:49,994 parts are. We could use bagging with neural nets, but 151 00:11:49,994 --> 00:11:53,261 it's very expensive. If you wanted to train say, twenty 152 00:11:53,261 --> 00:11:58,101 different neural nets this way, you'd have to get your twenty different training 153 00:11:58,101 --> 00:12:00,763 sets. And then it would take twenty times as 154 00:12:00,763 --> 00:12:04,695 long as training one net. That doesn't matter with decision tress 155 00:12:04,695 --> 00:12:08,707 cuz they're so fast to train. Also, at test time, you'd have to run 156 00:12:08,707 --> 00:12:12,908 these twenty different nets. Again, with decision trees, that doesn't 157 00:12:12,908 --> 00:12:15,981 matter, cuz they're so fast to use at test time. 158 00:12:15,981 --> 00:12:20,997 Another method for making the training data different is to train each model on 159 00:12:20,997 --> 00:12:24,948 the whole training set, But to weight the cases differently So, in 160 00:12:24,948 --> 00:12:29,337 boosting, we typically we use a sequence of fairly low capacity models. 161 00:12:29,337 --> 00:12:33,100 And we weight the training cases for each model differently. 162 00:12:33,460 --> 00:12:38,079 What we do is we up weight the cases the previous model got wrong and we down 163 00:12:38,079 --> 00:12:40,685 weight the case of previous model got right. 164 00:12:40,685 --> 00:12:45,660 So the next model in the sequence doesn't waste its time trying to model cases that 165 00:12:45,660 --> 00:12:49,331 are already correct. It uses its resources to try to deal with 166 00:12:49,331 --> 00:12:55,721 cases the other models are getting wrong. An early use of boosting, was with neural 167 00:12:55,721 --> 00:12:59,138 nets for MNIST, And there when computer's are actually 168 00:12:59,138 --> 00:13:02,059 slower. One of the big advantage is was that it 169 00:13:02,059 --> 00:13:06,098 focused to competitional resources on modelling the tricky cases, 170 00:13:06,098 --> 00:13:10,386 And didn't waste a lot of time, going over easy cases again and again.