1
00:00:00,000 --> 00:00:05,974
In this video, I'm going to describe a new
way of combining a very large number of

2
00:00:05,974 --> 00:00:11,803
neural network models without having to
separately train a very large number of

3
00:00:11,803 --> 00:00:15,155
models.
This is a method called dropout that's

4
00:00:15,155 --> 00:00:19,090
recently been very successful in winning
competitions.

5
00:00:19,090 --> 00:00:23,136
For each training case, we randomly omit
some of the hidden units.

6
00:00:23,136 --> 00:00:27,245
So, we end up with a different
architecture for each training case.

7
00:00:27,245 --> 00:00:31,789
We can think of this as having a different
model for every training case.

8
00:00:31,789 --> 00:00:36,333
And then, the question is, how could we
possibly train a model on only one

9
00:00:36,333 --> 00:00:41,500
training case and how could we average all
these models together efficiently at test

10
00:00:41,500 --> 00:00:44,426
time?
The answer is that we use a great deal of

11
00:00:44,426 --> 00:00:47,477
weight sharing.
I want to start by describing two

12
00:00:47,477 --> 00:00:51,150
different ways of combining the outputs of
multiple models.

13
00:00:51,150 --> 00:00:57,260
In a mixture, we combine models by
averaging their output probabilities.

14
00:00:57,520 --> 00:01:03,299
So, if model A assigns probabilities of
0.3, 0.2 and 0.5, to three different

15
00:01:03,299 --> 00:01:09,247
answers, model B assigns probabilities of
0.1, 0.8 and 0.1, the combined model

16
00:01:09,247 --> 00:01:13,519
simply assigns the averages of those
probabilities.

17
00:01:13,519 --> 00:01:20,387
A different way of combining models is to
use a product of the probabilities. Here,

18
00:01:20,387 --> 00:01:24,660
we take a geometric mean of the same
probabilities.

19
00:01:25,040 --> 00:01:30,730
So, model A and model B again assign the
same probabilities as they did before.

20
00:01:30,730 --> 00:01:36,493
But now, what we do is we multiply each
pair of probabilities together and then

21
00:01:36,493 --> 00:01:40,505
take the square root.
That's the geometric mean and the

22
00:01:40,505 --> 00:01:44,517
geometric means will generally add up to
less than one.

23
00:01:44,517 --> 00:01:49,697
So, we have to divide by the sum of the
geometric means to normalize the

24
00:01:49,697 --> 00:01:52,980
distribution so that it adds up to one
again.

25
00:01:54,440 --> 00:02:01,573
You'll notice that in a product, a small
probability output by one model, has veto

26
00:02:01,573 --> 00:02:07,085
power over the other models.
Now I want to describe an efficient way to

27
00:02:07,085 --> 00:02:12,043
average a large number of neural nets that
gives us an alternative to doing the

28
00:02:12,043 --> 00:02:15,823
correct Bayesian thing.
The alternative probably doesn't work

29
00:02:15,823 --> 00:02:20,100
quite as well as doing the correct
Bayesian thing, but it's much more

30
00:02:20,100 --> 00:02:23,750
practical.
So, consider the neural net with one

31
00:02:23,750 --> 00:02:29,649
hidden layer, shown on the right.
Each time we present a training example to

32
00:02:29,649 --> 00:02:32,412
it,
What we're going to do is randomly emit

33
00:02:32,412 --> 00:02:35,520
each hidden unit with a probability of
0.5.

34
00:02:35,520 --> 00:02:38,905
So, we crossed out three of the hidden
units here.

35
00:02:38,905 --> 00:02:43,740
And we run the example through the net
with those hidden units absent.

36
00:02:43,740 --> 00:02:48,953
What this means is that we're randomly
sampling from two to the h architectures,

37
00:02:48,953 --> 00:02:53,775
where h is the number of hidden units,
It's a huge number of architectures.

38
00:02:53,775 --> 00:02:57,033
Of course, all of these architectures show
weights.

39
00:02:57,033 --> 00:03:02,116
That ism whenever we use a hidden unit,
it's got the same weight as it's got in

40
00:03:02,116 --> 00:03:07,634
other architectures.
So, we can think of dropout as a form of

41
00:03:07,634 --> 00:03:11,993
model averaging.
We sample from these two to the h models.

42
00:03:11,993 --> 00:03:15,740
Most of the models, in fact, will never be
sampled.

43
00:03:16,180 --> 00:03:20,140
And a model of this sampled only gets one
training example.

44
00:03:20,560 --> 00:03:25,905
That's a very extreme form of bagging.
The training sets are very different for

45
00:03:25,905 --> 00:03:29,062
the different models, but they're also
very small.

46
00:03:29,062 --> 00:03:34,214
The sharing of the weights between all the
models means that each model is very

47
00:03:34,214 --> 00:03:39,238
strongly regularized by the others.
And this is a much better regularizer than

48
00:03:39,238 --> 00:03:43,811
things like L2 or L1 penalties.
Those penalties pull the weights toward

49
00:03:43,811 --> 00:03:46,645
zero.
By sharing weights with other models, a

50
00:03:46,645 --> 00:03:51,605
models gets regularized by something
that's going to tend to pull the weight

51
00:03:51,605 --> 00:03:56,281
towards the correct value.
The question still remains what we do with

52
00:03:56,281 --> 00:03:59,459
test time.
So, we could sample many of the

53
00:03:59,459 --> 00:04:03,870
architectures, maybe a hundred, and take
the geometric mean of the output

54
00:04:03,870 --> 00:04:06,688
distributions.
But that would be a lot of work.

55
00:04:06,688 --> 00:04:12,662
There's something much simpler we can do.
We use all of the hidden units, but we

56
00:04:12,662 --> 00:04:17,456
halve their outgoing weights.
So, they have the same expected effect as

57
00:04:17,456 --> 00:04:23,953
they did when we were sampling.
It turns out that using all of the hidden

58
00:04:23,953 --> 00:04:29,694
units with half their outgoing weights,
exactly computes the geometric mean that

59
00:04:29,694 --> 00:04:35,147
the predictions that all two to the h
models would have used, provided we're

60
00:04:35,147 --> 00:04:40,706
using a softmax output group.
If we have more than one hidden layer, we

61
00:04:40,706 --> 00:04:44,020
can simply use drop out at 0.5 in every
layer.

62
00:04:44,600 --> 00:04:49,500
At test time, we halve all the outgoing
weights of hidden units,

63
00:04:49,760 --> 00:04:53,153
And that gives us what I call the mean
net.

64
00:04:53,153 --> 00:04:58,600
So, we use a net that has all of the units
but the weights are halved.

65
00:04:59,520 --> 00:05:05,157
When we have multiple hidden layers, this
is not exactly the same as averaging lots

66
00:05:05,157 --> 00:05:09,980
of set per dropout model, but it's a good
approximation and it's fast.

67
00:05:10,840 --> 00:05:16,063
We could run lots of stochastic models
with dropout, and then average across

68
00:05:16,063 --> 00:05:20,609
those stochastic models.
And that would have one advantage over the

69
00:05:20,609 --> 00:05:23,808
mean net.
It would give us an idea of the

70
00:05:23,808 --> 00:05:27,432
uncertainty in the answer.
What about the input layer?

71
00:05:27,432 --> 00:05:30,236
Well, we can use the same trick there,
too.

72
00:05:30,236 --> 00:05:35,912
We use dropout on the inputs, but we use a
higher probability of keeping an input.

73
00:05:35,912 --> 00:05:41,519
This trick's already in use in a system
called denoising autoencoders, developed

74
00:05:41,519 --> 00:05:46,511
by Pascal Vincent, Hugo Laracholle and
Yoshua Bengio at the University of

75
00:05:46,511 --> 00:05:50,882
Montreal, and it works very well.
So, how well does dropout work?

76
00:05:50,882 --> 00:05:55,528
Well, the record breaking object
recognition net developed by Alex

77
00:05:55,528 --> 00:05:59,823
Krizhevsky would have broken the record
even without dropout.

78
00:05:59,823 --> 00:06:05,666
But it broke a lot more by using dropout.
In general, if you have a deep neural net

79
00:06:05,666 --> 00:06:10,735
and it's overfitting dropout, it will
typically reduce the number errors by

80
00:06:10,735 --> 00:06:13,624
quite a lot.
I think any net that requires early

81
00:06:13,624 --> 00:06:17,736
stopping in order to prevent it
overfitting would do better by using

82
00:06:17,736 --> 00:06:22,376
dropout. It would, of course, take longer
to train and it might mean more hidden

83
00:06:22,376 --> 00:06:25,078
units.
If you got a deep neural net and it's not

84
00:06:25,078 --> 00:06:29,776
overfitting, you should probably be using
a bigger one and using dropout, that's

85
00:06:29,776 --> 00:06:32,420
assuming you have enough computational
power.

86
00:06:32,420 --> 00:06:37,195
There's another way to think about
dropout, which is how I originally arrived

87
00:06:37,195 --> 00:06:41,576
at the idea.
And you'll see it's a bit related to

88
00:06:41,576 --> 00:06:46,728
mixtures of experts, and what's going
wrong when all the experts cooperate,

89
00:06:46,728 --> 00:06:51,672
What's preventing specialization?
So, if a hidden unit knows which other

90
00:06:51,672 --> 00:06:57,451
hidden units are present, it can co-adapt
to the other hidden units on the training

91
00:06:57,451 --> 00:07:00,723
data.
What that means is, the real signal that's

92
00:07:00,723 --> 00:07:06,294
training a hidden unit is, try to fix up
the error that's leftover when all the

93
00:07:06,294 --> 00:07:11,638
other hidden units have had their say.
That's what's being back propagated to

94
00:07:11,638 --> 00:07:16,412
train the weights of each hidden unit.
Now, that's going to cause complex

95
00:07:16,412 --> 00:07:21,808
co-adaptations between the hidden units.
And these are likely to go wrong when

96
00:07:21,808 --> 00:07:25,130
there's a change in the data.
So, a new test data,

97
00:07:25,130 --> 00:07:30,249
If you rely on a complex co-adaptation to
get things right on the training data,

98
00:07:30,249 --> 00:07:34,120
it's quite likely to not work nearly so
well on new test data.

99
00:07:34,120 --> 00:07:38,575
It's like the idea that a big, complex
conspiracy involving lots of people is

100
00:07:38,575 --> 00:07:43,030
almost certain to go wrong because there's
always things you didn't think of.

101
00:07:43,030 --> 00:07:47,601
And if there's a large number of people
involved, one of them will behave in an

102
00:07:47,601 --> 00:07:50,667
unexpected way.
And then, the others will be doing the

103
00:07:50,667 --> 00:07:53,791
wrong thing.
It's much better if you want conspiracies,

104
00:07:53,791 --> 00:07:58,189
to have lots of little conspiracies.
Then, when unexpected things happen, many

105
00:07:58,189 --> 00:08:02,412
of the little conspiracies will fail, but
some of them will still succeed.

106
00:08:02,412 --> 00:08:06,137
So, by using dropout,
We force a hidden unit to work with

107
00:08:06,137 --> 00:08:09,560
combinatorially many other sets of hidden
units.

108
00:08:09,560 --> 00:08:13,523
And that makes it much more likely to do
something that's individually useful

109
00:08:13,523 --> 00:08:17,436
rather than only useful because of the way
particular other hidden units are

110
00:08:17,436 --> 00:08:20,180
collaborating with it.
But it is also going to tend to do

111
00:08:20,180 --> 00:08:24,397
something that's individually useful and
is different from what other hidden units

112
00:08:24,397 --> 00:08:27,222
do.
It needs to be something that's marginally

113
00:08:27,222 --> 00:08:30,266
useful, given what its co-workers tend to
achieve.

114
00:08:30,266 --> 00:08:35,238
And I think this is what's giving nets
with dropout, their very good performance.