1
00:00:00,000 --> 00:00:06,133
In this video, I'm going to talk about the
mixture of experts model that was

2
00:00:06,133 --> 00:00:11,378
developed in the early 1990s.
The idea of this model is to train a

3
00:00:11,378 --> 00:00:16,969
number of neural nets, each of which
specializes in a different part of the

4
00:00:16,969 --> 00:00:20,100
data.
That is, we assume we have a data set

5
00:00:20,100 --> 00:00:23,603
which comes from a number of different
regimes,

6
00:00:23,603 --> 00:00:29,790
And we train a system in which one neural
net will specialize in each regime, and a

7
00:00:29,790 --> 00:00:36,052
managing neural net will look at the input
data, and decide which specialist to give

8
00:00:36,052 --> 00:00:39,897
it to.
This kind of system, doesn't make very

9
00:00:39,897 --> 00:00:45,198
efficient use of data, because the data
is, fractionated over all these different

10
00:00:45,198 --> 00:00:48,379
experts.
And so with small data sets, it can't be

11
00:00:48,379 --> 00:00:52,752
expected to do very well.
But as data sets get bigger, this kind of

12
00:00:52,752 --> 00:00:58,119
system may well come into its own, because
it can make very good use of extremely

13
00:00:58,119 --> 00:01:02,721
large data sets.
In boosting, the weights on the models are

14
00:01:02,721 --> 00:01:06,630
not all equal,
But after we finish training, each model

15
00:01:06,630 --> 00:01:11,916
has the same weight for every test case.
We don't make the weights on the

16
00:01:11,916 --> 00:01:16,912
individual models depend on which
particular case we're dealing with.

17
00:01:16,912 --> 00:01:22,892
In mixture of experts, we do.
So the idea is that we can look at the

18
00:01:22,892 --> 00:01:27,656
input data for a particular case during
both training and testing to help us

19
00:01:27,656 --> 00:01:34,484
decide which model we can rely on.
During training this will allow models to

20
00:01:34,484 --> 00:01:40,618
specialize on a subset of the cases.
They then will not learn on cases for

21
00:01:40,618 --> 00:01:44,706
which they're not picked.
So they can ignore stuff they're not good

22
00:01:44,706 --> 00:01:47,940
at modeling.
This will lead to individual models that

23
00:01:47,940 --> 00:01:51,480
are very good at some things and very bad
at other things.

24
00:01:53,480 --> 00:01:58,632
The key idea is to make each model, or
expert as we call it, focus on predicting

25
00:01:58,632 --> 00:02:03,458
the right answer for cases where it's
already doing better than the other

26
00:02:03,458 --> 00:02:07,160
experts.
That will cause specialization.

27
00:02:09,520 --> 00:02:14,736
So there's a spectrum of models from very
local models to very global models.

28
00:02:14,736 --> 00:02:18,260
Nearest neighbors, for example, is a very
local model.

29
00:02:18,880 --> 00:02:21,735
To fit it, you just store the training
cases.

30
00:02:21,735 --> 00:02:25,954
So, that's really simple,
And then if you have to predict Y from X,

31
00:02:25,954 --> 00:02:30,887
you simply find the stored value of X
that's closest to the test value of X,

32
00:02:30,887 --> 00:02:35,560
then you predict the value of Y that's the
same as for the stored value.

33
00:02:36,040 --> 00:02:41,084
The result of that is that the curve
relating the input to the output consists

34
00:02:41,084 --> 00:02:44,148
of lots of horizontal lines connected by
cliffs.

35
00:02:44,148 --> 00:02:47,980
It would clearly make more sense to smooth
things out a bit.

36
00:02:48,300 --> 00:02:54,119
At the other extreme, we have fully global
models, like fitting one polynomial to all

37
00:02:54,119 --> 00:02:57,444
the data.
They're much harder to fit to data, and

38
00:02:57,444 --> 00:03:01,878
they may also be unstable.
That is, small changes in the data may

39
00:03:01,878 --> 00:03:07,517
cause big changes in the model you fit.
That's because each parameter depends on

40
00:03:07,517 --> 00:03:11,109
all the data.
In between these two ends of the spectrum,

41
00:03:11,109 --> 00:03:15,420
we have multiple local models, that are of
intermediate complexity.

42
00:03:16,340 --> 00:03:20,987
This is good if the data set contains
several different regimes and those

43
00:03:20,987 --> 00:03:24,756
different regimes have different
input/output relationships.

44
00:03:24,756 --> 00:03:29,467
In financial data for example the state of
the economy has a big effect on

45
00:03:29,467 --> 00:03:34,429
determining the mappings between inputs
and outputs, and you might want to have

46
00:03:34,429 --> 00:03:37,757
different models for different states of
the economy.

47
00:03:37,757 --> 00:03:42,908
But you might not know in advance how to
decide what constitutes different states

48
00:03:42,908 --> 00:03:46,300
of the economy, you're going to have to
learn that too.

49
00:03:47,820 --> 00:03:51,962
So we have this problem if we're going to
use different models for different

50
00:03:51,962 --> 00:03:56,160
regimes, of how do we partition the data
session to these different regimes.

51
00:03:58,040 --> 00:04:03,981
In order to fit different models to
different regimes we need to cluster the

52
00:04:03,981 --> 00:04:08,380
training data into subsets, one for each
of these regimes.

53
00:04:09,200 --> 00:04:14,805
But we don't want to cluster the data
based on the similarity of input vectors.

54
00:04:14,805 --> 00:04:19,560
All we're interested in is the similarity
of input-output mappings.

55
00:04:19,980 --> 00:04:24,635
So if you look at the case on the right,
there's four data points that are nicely

56
00:04:24,635 --> 00:04:29,348
fitted by the red parabola and another
four data points that are nicely fitted by

57
00:04:29,348 --> 00:04:35,549
the green parabola If she partition the
data based on the input I put mapping,

58
00:04:35,549 --> 00:04:40,895
that is based on the idea that a parabola
will fit the data nicely, then you

59
00:04:40,895 --> 00:04:44,060
partition the data where that brown line
is.

60
00:04:45,300 --> 00:04:50,015
If however you partitioned the data by
just clustering the inputs, we partition

61
00:04:50,015 --> 00:04:54,793
where the blue line is, and then if you
looked to the left of that blue line,

62
00:04:54,793 --> 00:05:00,200
you'll be stuck with a subset of data that
can't be modeled nicely by a simple model.

63
00:05:00,600 --> 00:05:05,176
So I'm going to explain an error function
that encourages models to cooperate.

64
00:05:05,176 --> 00:05:09,508
And then I'm going to explain an error
function that encourages models to

65
00:05:09,508 --> 00:05:12,436
specialize.
And I'm going to try to give you a good

66
00:05:12,436 --> 00:05:16,951
intuition for why these two different
functions have these very different

67
00:05:16,951 --> 00:05:21,870
effects.
So if you want to encourage cooperation,

68
00:05:21,870 --> 00:05:27,493
what you should do is compare the average
predictors with the target and train all

69
00:05:27,493 --> 00:05:32,777
the predictors together to reduce the
difference between the target and their

70
00:05:32,777 --> 00:05:37,166
average.
So using angle back as for expectation

71
00:05:37,166 --> 00:05:43,060
again, the error would be the difference
between the target and the average of all

72
00:05:43,060 --> 00:05:48,800
the predictors of what they predict.
That will overfit badly.

73
00:05:49,360 --> 00:05:53,831
It will make the model much more powerful
in training each predictor separately,

74
00:05:53,831 --> 00:05:58,080
because the models will learn to fix up
the error is that other models make.

75
00:05:59,720 --> 00:06:05,016
So, if you're averaging models during
training, and training so that the average

76
00:06:05,016 --> 00:06:08,369
works nicely, you have to consider cases
like this.

77
00:06:08,369 --> 00:06:13,062
On the right, we have the average of all
the models except for model I.

78
00:06:13,062 --> 00:06:18,359
So, that's what everybody else is saying
when their votes are averaged together.

79
00:06:18,359 --> 00:06:21,175
On the left, we have the output of model
I.

80
00:06:21,175 --> 00:06:26,740
Now if we'd like the overall average to be
closer to the target, what do we have to

81
00:06:26,740 --> 00:06:32,263
do to the output of the Ith model?
We have to move it away from the target.

82
00:06:32,263 --> 00:06:35,994
That will take the overall average towards
the target.

83
00:06:35,994 --> 00:06:41,936
You can see that what's happening is model
I is learning to compensate for the errors

84
00:06:41,936 --> 00:06:48,350
made by all the other models.
But do we really want to move model I in

85
00:06:48,350 --> 00:06:52,630
the wrong direction?
Intuitively it seems better to move model

86
00:06:52,630 --> 00:06:59,660
I towards the target.
So here is an arrow function that

87
00:06:59,660 --> 00:07:03,440
encourages specialization, and it's not
very different.

88
00:07:03,440 --> 00:07:08,970
To encourage specialization, we compare
the output of each model with the target

89
00:07:08,970 --> 00:07:13,305
separately.
We also need to use a manager to determine

90
00:07:13,305 --> 00:07:18,372
the weight we put on each of these models,
which we can think of as the probability

91
00:07:18,372 --> 00:07:21,120
of picking each model, if we have to pick
one.

92
00:07:22,680 --> 00:07:28,808
So now, our error is the expectation over
all the different models of the squared

93
00:07:28,808 --> 00:07:34,028
error made by that model times the
probability of picking that model,

94
00:07:34,028 --> 00:07:40,232
Where the manager or gating network, is
determining that probability by looking at

95
00:07:40,232 --> 00:07:46,025
the input for this particular case.
What will happen if you try to minimize

96
00:07:46,025 --> 00:07:50,675
this error is that most of the experts
will end up ignoring most of the targets.

97
00:07:50,675 --> 00:07:55,500
Each expert will only deal with the small
subset of the training cases and it will

98
00:07:55,500 --> 00:07:58,000
learn to do very well on that small
subset.

99
00:07:59,260 --> 00:08:02,997
So here's a picture of the mixture of
expert's architecture.

100
00:08:02,997 --> 00:08:07,419
Our cost function is the squared
difference between the output of each

101
00:08:07,419 --> 00:08:10,596
expert in the target averaged over all the
experts.

102
00:08:10,596 --> 00:08:14,520
But with the weights in that average
determined by the manager.

103
00:08:15,560 --> 00:08:19,842
It's actually a better cost function will
come to later, based on the mixture model.

104
00:08:19,842 --> 00:08:23,807
But this was a cost function I first
thought of, and I think it's easier to

105
00:08:23,807 --> 00:08:26,240
explain the intuition with this cost
function.

106
00:08:28,660 --> 00:08:32,544
So we have an input.
Our different experts will look at that

107
00:08:32,544 --> 00:08:35,587
input.
They all make their predictions based on

108
00:08:35,587 --> 00:08:39,309
that input.
In addition we have a manager, a manager

109
00:08:39,309 --> 00:08:44,876
might have multiple layers and the last
layer for manager is a soft max layer, so

110
00:08:44,876 --> 00:08:49,206
the manager outputs as many probabilities
as there are experts,

111
00:08:49,206 --> 00:08:54,361
And using the outputs of the manger and
outputs of the experts, we can then

112
00:08:54,361 --> 00:09:01,918
compute the value of that error fraction.
If we look at the derivative of that other

113
00:09:01,918 --> 00:09:06,897
function,
The outputs of the manager are determined

114
00:09:06,897 --> 00:09:12,699
by the inputs xi to the soft max group in
the final layer of the manager.

115
00:09:12,699 --> 00:09:18,739
And then the error is determined by the
outputs of the experts, and also the

116
00:09:18,739 --> 00:09:24,000
probabilities output by the manager.
If we differentiate that error with

117
00:09:24,000 --> 00:09:29,401
respect to the outputs of an expert, we
get a signal for training that expert and

118
00:09:29,401 --> 00:09:34,602
that gradient that we get with respect to
the output of an expert is just the

119
00:09:34,602 --> 00:09:40,003
probability of picking that expert, times
the difference between what that expert

120
00:09:40,003 --> 00:09:43,698
says in the target.
So if the manager decides that there's a

121
00:09:43,698 --> 00:09:48,319
very low probability of picking that
expert for that particular training case,

122
00:09:48,319 --> 00:09:53,058
the expert will get a very small gradient,
and the parameters inside that expert

123
00:09:53,058 --> 00:09:57,975
won't get disturbed by that training case.
It'll be able to save its parameters for

124
00:09:57,975 --> 00:10:02,300
modeling the training cases where the
manager gives it a big probability.

125
00:10:03,760 --> 00:10:08,420
We can differentiate with respect to the
outputs of the gating network.

126
00:10:08,420 --> 00:10:13,138
And actually what we're gonna do is
differentiate with respect to, the

127
00:10:13,138 --> 00:10:18,060
quantity that goes into the soft max.
That's called the low jet, that's xi,

128
00:10:18,060 --> 00:10:24,789
And if we take the derivative with respect
to xi, we get the probability that, that

129
00:10:24,789 --> 00:10:31,190
expert was picked times the difference
between the squared arrow made by that

130
00:10:31,190 --> 00:10:37,920
expert and the average overall experts
when you use the weighting provided by the

131
00:10:37,920 --> 00:10:43,256
manager of the squared arrow.
So what that means is, if expert I makes a

132
00:10:43,256 --> 00:10:49,028
lower squared error than the average of
the other experts, then we'll try to raise

133
00:10:49,028 --> 00:10:53,744
the probability of expert i.
But if expert I makes a higher squared

134
00:10:53,744 --> 00:10:58,320
error than the other experts, we'll try
and lower his probability.

135
00:10:58,600 --> 00:11:06,006
That's what causes specialization.
Now there's actually a better cost

136
00:11:06,006 --> 00:11:08,440
function.
It's just more complicated.

137
00:11:08,440 --> 00:11:13,174
It depends on mixture models, which I
haven't explained in this course.

138
00:11:13,174 --> 00:11:17,095
Again, those will be well explained in
Andrew Ing's course.

139
00:11:17,095 --> 00:11:22,572
I did explain, however, the interpretation
of maximum likelihood, when you're doing

140
00:11:22,572 --> 00:11:28,050
regression, as the idea that the network
is actually making a Gaussian prediction.

141
00:11:28,050 --> 00:11:34,016
That is the network outputs a particular
value, say Y1 and we think of it as making

142
00:11:34,016 --> 00:11:40,341
bets about what the target value might be
that are a Gaussian distribution around Y1

143
00:11:40,341 --> 00:11:44,438
with unit variance.
So the red expert makes a Gaussian

144
00:11:44,438 --> 00:11:50,261
distribution of predictions around by Y1
and the green expert makes a prediction

145
00:11:50,261 --> 00:11:54,465
around Y2.
The manager then decides probabilities for

146
00:11:54,465 --> 00:11:59,597
the two experts and those probabilities
are used to scale down the Gaussians.

147
00:11:59,597 --> 00:12:04,795
Those probabilities have to add to one and
they are called mixing proportions.

148
00:12:04,795 --> 00:12:10,260
And so once we scale down the Gaussians we
get to distribution that's no longer a

149
00:12:10,260 --> 00:12:15,324
Gaussian, is the sum of the scale down red
Gaussian and the scale down green

150
00:12:15,324 --> 00:12:18,523
Gaussian.
And that's the predictive distribution

151
00:12:18,523 --> 00:12:22,682
from share experts.
What we want to do now is maximize the log

152
00:12:22,682 --> 00:12:28,334
probability of the target value under that
black curve and remember the black curve

153
00:12:28,334 --> 00:12:31,900
is just the sum of the red curve and the
green curve.

154
00:12:32,640 --> 00:12:39,756
So that leads to the following model for
the probability re-target, given a mixture

155
00:12:39,756 --> 00:12:44,440
of experts.
The probability, is on the left,

156
00:12:45,640 --> 00:12:50,906
And it's the sum over all the experts, of
the mixing proportion assigned to that

157
00:12:50,906 --> 00:12:57,134
expert by the manager or gating network
times e squared the squared difference

158
00:12:57,134 --> 00:13:01,720
between the target and the output of that
expert,

159
00:13:02,200 --> 00:13:06,740
Scaled by the normalization term for a
Gaussian with a variance of one.

160
00:13:07,140 --> 00:13:11,468
And so our cost function is simply going
to be the negative log of that probability

161
00:13:11,468 --> 00:13:13,873
on the left.
We're going to try and minimize the

162
00:13:13,873 --> 00:13:15,637
negative log of that probability.