In this video, I'm going to talk about the reason why we want to combine many models when we're making predictions. If we have a single model, we have to choose some capacity for it. If we choose too little capacity, it would be able to fit the regularities in the training data. And if we choose too much capacity, it won't be able to fit the sampling error in the particular training set we have. By using many models, we can actually get a better tradeoff between fitting the true regularities, and overfitting the sampling error in the data. At the start of the video, I'll show you that when you average models together, you can expect to do better than any single model. This effect is largest when the models make very different predictions from each other. And at the end of this video, I'll discuss various ways in which we can encourage the different models to make very different predictions. As we've seen before, when we have a limited amount of training data, we tend to get overfitting. If we average the predictions of many different models we can typically reduce that overfitting. This helps most when the models make very different predictions from one another. For regression, the squared arrow can be decomposed into a bias term and a variance term. And that allows us to analyze what's going on. The bias term is big if the model has too little capacity to fit the data. It measures how poorly the model approximates the true function. The variance term is big if the model has so much capacity that it's good at modeling the sampling error in our particular training set. So, it's called variance, because if we go and get another training set of the same size from the same distribution, our model will fit differently to that training set, because it has different sampling error. And so we'll get variance in the way the models fit to different training sets. If we average models together, what we're doing is we're averaging away the variance, And that allows us to use individual models that have high capacity and therefore high variance. These high capacity model typically have low bias. So we can get the low bias without incurring the high variance by using averaging to get rid of the variance. So now let's try and analyze how an individual model compares with an average of models. On any one test case some individual predictors may be better than the combined predictor. The different individual predictors will be better on different cases. And if the individual predictors disagree a lot, the combined predictor is typically better than all of the individual predictors when we average over test cases. So we should aim to make the individual predictors disagree, without making them be poor predictors. The art is to have individual predictors that make very different errors from one another, but are each fairly accurate. So, now let's look at the math and what happens when we combine networks. We're going to compare two expected squared errors. The first expected squared error is the one we get if we pick one of the predictors at random and use that for making our predictions. And then what we do is we average overall predictors, the error we'd expect to get if we followed that policy. So Y bar is the average of what all the predictors say, and YI is what an individual predictor says. So Y bar is just the expectation over all the individual predictors I of YI and I'm using those angle brackets to represent an expectation, where the thing that comes after the angle bracket tells you what it's an expectation over. We can write the same thing as one over n times the sum overall of the n of the yi. Now, if we look at the expected squared error we'd get if we chose a predictor at random, What we'd have to do is compare that predictor with the target, take the squared difference. And then average that over all predictors. That's also on the left hand side there. If I simply add a Y bar and subtract a Y bar, I don't change the value. And now it's going to be easier to do some manipulations. I can now multiply it that squared and inside this expectation bracket I have t minus y bar squared, y I minus y bar square, and t minus y bar into y I minus y bar, which has the c will disappear. So the first term, T minus Y bar squared, doesn't have an I in it anymore, and so we can forget about the expectation brackets for that. That really is T minus Y bar squared. And that's the squared arrow you'd get if you compared the average of the models with the target. And our aim is to show the thing on the left hand side is bigger than that, i.e., by using that average, we've reduced the expected squared error. So the extra term we have on the right hand side, is the expectation of y i minus y bar squared. And that's just the variance of the y i. It's the expected squared difference between y I and y bar. And then the last tone disappears, it disappears because the difference of Y I from Y bar we expect to be uncorrelated with the difference between the arrow that the average of the networks makes on the target. And so we're multiplying together two things that are zero mean and uncorrelated and we expect to get zero on average. So the result is that the expected squared error we get by picking a model at random is greater than the squared error we get by averaging the models by the variance of the outputs of the models. That's how much we win by when we take an average. So, I want to show you that in a picture. So, along the horizontal line, we have the possible values of the output, and in this case, all of the different models predict a value that is too high. The predictors that are further than average from T make bigger than average squared errors, like that bad guy in red, and the predictors that are less than the average distance from T make smaller than average squared arrows. And the first effect dominates, because we're using squared error. So if you look at the math, let's suppose that the good guy and the bad guy were equally far from the mean. So the average squared error they make is Y bar minus epsilon squared plus Y bar plus epsilon squared. And when we work that out, we get the squared error that the mean of the predictors makes, plus an epsilon squared. So we win by averaging predictors before we compare them with the target. That's not always true. It depends very much on using a squared error. If, for example, you have a whole bunch of clocks. And you try and make them more accurate by averaging them all, That'll be a disaster. And it'll be a disaster because the noise you expect in clocks isn't Gaussian noise. What you expect is that, many of them will be very slightly wrong and a few of them will have stopped or will be wildly wrong. And if you average, you make sure they are all significantly wrong, which is not what you want. The same thing applies to the discrete distribution as we have our class labeled probabilities. So suppose that we have two models, and one gives the correct label of probability of Pi, and the other gives the correct label of probability of Pj. Is it better to pick one model at random, or it is it better to average those two probabilities, and predict the average of Pi and Pj. What if I had a measure is the log probability of getting the right answer? Then, the log of the average of Pi and Pj is going to be a better bet than the log of Pi plus the log of Pj averaged. That's most easily seen in a diagram because of the shape of the log function. So that black curve is the log. On the horizontal access I've drawn Pi and Pj, And the gold colored line, joins log Pi to log Pj. You can see that if we first start with Pi and Pj together, to get that average value at the blue arrow is, and then we compute the log, we get that blue dot. Whereas if we first take the log of pi, and separately take the log of pj, and then we average those two logs, we get the mid-point of that gold line, Which is below the blue dot. So to make this averaging be a big win, we want our predictors to differ by a lot. And there's many different ways to make them differ. You could just rely on a learning algorithm that doesn't work too well, and get stuck in different local optima each time. It's not a very intelligent thing to do, but it's worth a try. You could use lots of different kinds of models, including ones that are not neural networks. So, it makes sense to try decision trees, Gaussian process models, support vector machines. I'm not explaining any of those in this course. In Andrew Ng's machine on Coursera, you can learn about all those things. Well you could try many other different kinds of model. If you really want to use a bunch of different neural-network models, you can make them different by using a different number of hidden layers or a different number of units per layer or different types of unit. Like in some nets you could use rectified-linear units, And in other nets you could use logistic units. You could use different types or strengths of weight penalty. So you might use early stopping for some nets, and an L2 weight penalty for others, and an L1 weight penalty for others. You could use different learning algorithms. So for example you could use full batch for some, and mini batch for others, if your data set is small enough to allow that. You can also make the models differ by training the models on different training data. So, there's a method introduced by Leo Breiman called bagging, where you train different models on different subsets of the data. And you get these subsets by sampling the training set with replacement. So we sampled a training set that had examples A, B, C, D, and E. And we got five examples, but we'll have some missing and some duplicated. And we train one of our models on that particular training set. This is done in a method called random forest that uses bagging with decision trees, which Leo Breiman was also involved in inventing. When you train decision trees with bagging and then average them together, they work much better than single decision tree bys themselves. In fact, the connect box uses random forests to convert information about depth into information about where your body parts are. We could use bagging with neural nets, but it's very expensive. If you wanted to train say, twenty different neural nets this way, you'd have to get your twenty different training sets. And then it would take twenty times as long as training one net. That doesn't matter with decision tress cuz they're so fast to train. Also, at test time, you'd have to run these twenty different nets. Again, with decision trees, that doesn't matter, cuz they're so fast to use at test time. Another method for making the training data different is to train each model on the whole training set, But to weight the cases differently So, in boosting, we typically we use a sequence of fairly low capacity models. And we weight the training cases for each model differently. What we do is we up weight the cases the previous model got wrong and we down weight the case of previous model got right. So the next model in the sequence doesn't waste its time trying to model cases that are already correct. It uses its resources to try to deal with cases the other models are getting wrong. An early use of boosting, was with neural nets for MNIST, And there when computer's are actually slower. One of the big advantage is was that it focused to competitional resources on modelling the tricky cases, And didn't waste a lot of time, going over easy cases again and again.