In this video, I'm going to describe a new way of combining a very large number of neural network models without having to separately train a very large number of models. This is a method called dropout that's recently been very successful in winning competitions. For each training case, we randomly omit some of the hidden units. So, we end up with a different architecture for each training case. We can think of this as having a different model for every training case. And then, the question is, how could we possibly train a model on only one training case and how could we average all these models together efficiently at test time? The answer is that we use a great deal of weight sharing. I want to start by describing two different ways of combining the outputs of multiple models. In a mixture, we combine models by averaging their output probabilities. So, if model A assigns probabilities of 0.3, 0.2 and 0.5, to three different answers, model B assigns probabilities of 0.1, 0.8 and 0.1, the combined model simply assigns the averages of those probabilities. A different way of combining models is to use a product of the probabilities. Here, we take a geometric mean of the same probabilities. So, model A and model B again assign the same probabilities as they did before. But now, what we do is we multiply each pair of probabilities together and then take the square root. That's the geometric mean and the geometric means will generally add up to less than one. So, we have to divide by the sum of the geometric means to normalize the distribution so that it adds up to one again. You'll notice that in a product, a small probability output by one model, has veto power over the other models. Now I want to describe an efficient way to average a large number of neural nets that gives us an alternative to doing the correct Bayesian thing. The alternative probably doesn't work quite as well as doing the correct Bayesian thing, but it's much more practical. So, consider the neural net with one hidden layer, shown on the right. Each time we present a training example to it, What we're going to do is randomly emit each hidden unit with a probability of 0.5. So, we crossed out three of the hidden units here. And we run the example through the net with those hidden units absent. What this means is that we're randomly sampling from two to the h architectures, where h is the number of hidden units, It's a huge number of architectures. Of course, all of these architectures show weights. That ism whenever we use a hidden unit, it's got the same weight as it's got in other architectures. So, we can think of dropout as a form of model averaging. We sample from these two to the h models. Most of the models, in fact, will never be sampled. And a model of this sampled only gets one training example. That's a very extreme form of bagging. The training sets are very different for the different models, but they're also very small. The sharing of the weights between all the models means that each model is very strongly regularized by the others. And this is a much better regularizer than things like L2 or L1 penalties. Those penalties pull the weights toward zero. By sharing weights with other models, a models gets regularized by something that's going to tend to pull the weight towards the correct value. The question still remains what we do with test time. So, we could sample many of the architectures, maybe a hundred, and take the geometric mean of the output distributions. But that would be a lot of work. There's something much simpler we can do. We use all of the hidden units, but we halve their outgoing weights. So, they have the same expected effect as they did when we were sampling. It turns out that using all of the hidden units with half their outgoing weights, exactly computes the geometric mean that the predictions that all two to the h models would have used, provided we're using a softmax output group. If we have more than one hidden layer, we can simply use drop out at 0.5 in every layer. At test time, we halve all the outgoing weights of hidden units, And that gives us what I call the mean net. So, we use a net that has all of the units but the weights are halved. When we have multiple hidden layers, this is not exactly the same as averaging lots of set per dropout model, but it's a good approximation and it's fast. We could run lots of stochastic models with dropout, and then average across those stochastic models. And that would have one advantage over the mean net. It would give us an idea of the uncertainty in the answer. What about the input layer? Well, we can use the same trick there, too. We use dropout on the inputs, but we use a higher probability of keeping an input. This trick's already in use in a system called denoising autoencoders, developed by Pascal Vincent, Hugo Laracholle and Yoshua Bengio at the University of Montreal, and it works very well. So, how well does dropout work? Well, the record breaking object recognition net developed by Alex Krizhevsky would have broken the record even without dropout. But it broke a lot more by using dropout. In general, if you have a deep neural net and it's overfitting dropout, it will typically reduce the number errors by quite a lot. I think any net that requires early stopping in order to prevent it overfitting would do better by using dropout. It would, of course, take longer to train and it might mean more hidden units. If you got a deep neural net and it's not overfitting, you should probably be using a bigger one and using dropout, that's assuming you have enough computational power. There's another way to think about dropout, which is how I originally arrived at the idea. And you'll see it's a bit related to mixtures of experts, and what's going wrong when all the experts cooperate, What's preventing specialization? So, if a hidden unit knows which other hidden units are present, it can co-adapt to the other hidden units on the training data. What that means is, the real signal that's training a hidden unit is, try to fix up the error that's leftover when all the other hidden units have had their say. That's what's being back propagated to train the weights of each hidden unit. Now, that's going to cause complex co-adaptations between the hidden units. And these are likely to go wrong when there's a change in the data. So, a new test data, If you rely on a complex co-adaptation to get things right on the training data, it's quite likely to not work nearly so well on new test data. It's like the idea that a big, complex conspiracy involving lots of people is almost certain to go wrong because there's always things you didn't think of. And if there's a large number of people involved, one of them will behave in an unexpected way. And then, the others will be doing the wrong thing. It's much better if you want conspiracies, to have lots of little conspiracies. Then, when unexpected things happen, many of the little conspiracies will fail, but some of them will still succeed. So, by using dropout, We force a hidden unit to work with combinatorially many other sets of hidden units. And that makes it much more likely to do something that's individually useful rather than only useful because of the way particular other hidden units are collaborating with it. But it is also going to tend to do something that's individually useful and is different from what other hidden units do. It needs to be something that's marginally useful, given what its co-workers tend to achieve. And I think this is what's giving nets with dropout, their very good performance.