In this video, I'm going to talk about the
mixture of experts model that was
developed in the early 1990s.
The idea of this model is to train a
number of neural nets, each of which
specializes in a different part of the
data.
That is, we assume we have a data set
which comes from a number of different
regimes,
And we train a system in which one neural
net will specialize in each regime, and a
managing neural net will look at the input
data, and decide which specialist to give
it to.
This kind of system, doesn't make very
efficient use of data, because the data
is, fractionated over all these different
experts.
And so with small data sets, it can't be
expected to do very well.
But as data sets get bigger, this kind of
system may well come into its own, because
it can make very good use of extremely
large data sets.
In boosting, the weights on the models are
not all equal,
But after we finish training, each model
has the same weight for every test case.
We don't make the weights on the
individual models depend on which
particular case we're dealing with.
In mixture of experts, we do.
So the idea is that we can look at the
input data for a particular case during
both training and testing to help us
decide which model we can rely on.
During training this will allow models to
specialize on a subset of the cases.
They then will not learn on cases for
which they're not picked.
So they can ignore stuff they're not good
at modeling.
This will lead to individual models that
are very good at some things and very bad
at other things.
The key idea is to make each model, or
expert as we call it, focus on predicting
the right answer for cases where it's
already doing better than the other
experts.
That will cause specialization.
So there's a spectrum of models from very
local models to very global models.
Nearest neighbors, for example, is a very
local model.
To fit it, you just store the training
cases.
So, that's really simple,
And then if you have to predict Y from X,
you simply find the stored value of X
that's closest to the test value of X,
then you predict the value of Y that's the
same as for the stored value.
The result of that is that the curve
relating the input to the output consists
of lots of horizontal lines connected by
cliffs.
It would clearly make more sense to smooth
things out a bit.
At the other extreme, we have fully global
models, like fitting one polynomial to all
the data.
They're much harder to fit to data, and
they may also be unstable.
That is, small changes in the data may
cause big changes in the model you fit.
That's because each parameter depends on
all the data.
In between these two ends of the spectrum,
we have multiple local models, that are of
intermediate complexity.
This is good if the data set contains
several different regimes and those
different regimes have different
input/output relationships.
In financial data for example the state of
the economy has a big effect on
determining the mappings between inputs
and outputs, and you might want to have
different models for different states of
the economy.
But you might not know in advance how to
decide what constitutes different states
of the economy, you're going to have to
learn that too.
So we have this problem if we're going to
use different models for different
regimes, of how do we partition the data
session to these different regimes.
In order to fit different models to
different regimes we need to cluster the
training data into subsets, one for each
of these regimes.
But we don't want to cluster the data
based on the similarity of input vectors.
All we're interested in is the similarity
of input-output mappings.
So if you look at the case on the right,
there's four data points that are nicely
fitted by the red parabola and another
four data points that are nicely fitted by
the green parabola If she partition the
data based on the input I put mapping,
that is based on the idea that a parabola
will fit the data nicely, then you
partition the data where that brown line
is.
If however you partitioned the data by
just clustering the inputs, we partition
where the blue line is, and then if you
looked to the left of that blue line,
you'll be stuck with a subset of data that
can't be modeled nicely by a simple model.
So I'm going to explain an error function
that encourages models to cooperate.
And then I'm going to explain an error
function that encourages models to
specialize.
And I'm going to try to give you a good
intuition for why these two different
functions have these very different
effects.
So if you want to encourage cooperation,
what you should do is compare the average
predictors with the target and train all
the predictors together to reduce the
difference between the target and their
average.
So using angle back as for expectation
again, the error would be the difference
between the target and the average of all
the predictors of what they predict.
That will overfit badly.
It will make the model much more powerful
in training each predictor separately,
because the models will learn to fix up
the error is that other models make.
So, if you're averaging models during
training, and training so that the average
works nicely, you have to consider cases
like this.
On the right, we have the average of all
the models except for model I.
So, that's what everybody else is saying
when their votes are averaged together.
On the left, we have the output of model
I.
Now if we'd like the overall average to be
closer to the target, what do we have to
do to the output of the Ith model?
We have to move it away from the target.
That will take the overall average towards
the target.
You can see that what's happening is model
I is learning to compensate for the errors
made by all the other models.
But do we really want to move model I in
the wrong direction?
Intuitively it seems better to move model
I towards the target.
So here is an arrow function that
encourages specialization, and it's not
very different.
To encourage specialization, we compare
the output of each model with the target
separately.
We also need to use a manager to determine
the weight we put on each of these models,
which we can think of as the probability
of picking each model, if we have to pick
one.
So now, our error is the expectation over
all the different models of the squared
error made by that model times the
probability of picking that model,
Where the manager or gating network, is
determining that probability by looking at
the input for this particular case.
What will happen if you try to minimize
this error is that most of the experts
will end up ignoring most of the targets.
Each expert will only deal with the small
subset of the training cases and it will
learn to do very well on that small
subset.
So here's a picture of the mixture of
expert's architecture.
Our cost function is the squared
difference between the output of each
expert in the target averaged over all the
experts.
But with the weights in that average
determined by the manager.
It's actually a better cost function will
come to later, based on the mixture model.
But this was a cost function I first
thought of, and I think it's easier to
explain the intuition with this cost
function.
So we have an input.
Our different experts will look at that
input.
They all make their predictions based on
that input.
In addition we have a manager, a manager
might have multiple layers and the last
layer for manager is a soft max layer, so
the manager outputs as many probabilities
as there are experts,
And using the outputs of the manger and
outputs of the experts, we can then
compute the value of that error fraction.
If we look at the derivative of that other
function,
The outputs of the manager are determined
by the inputs xi to the soft max group in
the final layer of the manager.
And then the error is determined by the
outputs of the experts, and also the
probabilities output by the manager.
If we differentiate that error with
respect to the outputs of an expert, we
get a signal for training that expert and
that gradient that we get with respect to
the output of an expert is just the
probability of picking that expert, times
the difference between what that expert
says in the target.
So if the manager decides that there's a
very low probability of picking that
expert for that particular training case,
the expert will get a very small gradient,
and the parameters inside that expert
won't get disturbed by that training case.
It'll be able to save its parameters for
modeling the training cases where the
manager gives it a big probability.
We can differentiate with respect to the
outputs of the gating network.
And actually what we're gonna do is
differentiate with respect to, the
quantity that goes into the soft max.
That's called the low jet, that's xi,
And if we take the derivative with respect
to xi, we get the probability that, that
expert was picked times the difference
between the squared arrow made by that
expert and the average overall experts
when you use the weighting provided by the
manager of the squared arrow.
So what that means is, if expert I makes a
lower squared error than the average of
the other experts, then we'll try to raise
the probability of expert i.
But if expert I makes a higher squared
error than the other experts, we'll try
and lower his probability.
That's what causes specialization.
Now there's actually a better cost
function.
It's just more complicated.
It depends on mixture models, which I
haven't explained in this course.
Again, those will be well explained in
Andrew Ing's course.
I did explain, however, the interpretation
of maximum likelihood, when you're doing
regression, as the idea that the network
is actually making a Gaussian prediction.
That is the network outputs a particular
value, say Y1 and we think of it as making
bets about what the target value might be
that are a Gaussian distribution around Y1
with unit variance.
So the red expert makes a Gaussian
distribution of predictions around by Y1
and the green expert makes a prediction
around Y2.
The manager then decides probabilities for
the two experts and those probabilities
are used to scale down the Gaussians.
Those probabilities have to add to one and
they are called mixing proportions.
And so once we scale down the Gaussians we
get to distribution that's no longer a
Gaussian, is the sum of the scale down red
Gaussian and the scale down green
Gaussian.
And that's the predictive distribution
from share experts.
What we want to do now is maximize the log
probability of the target value under that
black curve and remember the black curve
is just the sum of the red curve and the
green curve.
So that leads to the following model for
the probability re-target, given a mixture
of experts.
The probability, is on the left,
And it's the sum over all the experts, of
the mixing proportion assigned to that
expert by the manager or gating network
times e squared the squared difference
between the target and the output of that
expert,
Scaled by the normalization term for a
Gaussian with a variance of one.
And so our cost function is simply going
to be the negative log of that probability
on the left.
We're going to try and minimize the
negative log of that probability.