In this video, I'll talk about how we can
control compacity by limiting the size of
the weights.
The standard way to do this is to
introduce a penalty that prevents the
weights from getting too large.
With the implicite assumption that a
network with small weights is somehow
simpler than a network with large weights.
There are several different penalty terms
we can use.
And it's also possible to constrain the
weights, so that the incoming weight
vector to each of the hidden units is not
allowed to be longer than a certain
length.
The standard weight can limit the size of
the weights, is to use an L2 weight
penalty, which means that we penalize the
squared value of the weight.
This is sometimes called weight decay in
the neural network literature, because the
derivative of that penalty acts like a
force pulling the weight towards zero.
This weight penalty keeps the weight
small, unless they have big urge
derivatives to counteract it.
So if you look at what the penalty term
looks like, as the weight moves away from
zero, you get this parabolic cost.
If you look at the equation, the cost that
you're optimizing is the normal error that
you're trying to reduce, plus a term which
is the sum of the squares of the weights,
with a coefficient in front, Lambda.
And we divide by two so that when we
differentiate the two is cancelled.
That coefficient in front of the sum
squared weights is sometimes called the
weight cost.
That determines how strong the penalty is.
If you differentiate, you can see the
derivative of this cost is just the
derivative of the error plus something
that's to do with the size of the weight
and the value of Lambda.
That derivative will be zero when the
magnitude of the weight is one over Lambda
times the magnitude of the derivative.
So the only way you can have big weights,
when you're at a minimum of the cost
function, is if they also have big error
derivatives.
And this makes the weights much easier to
interpret.
You don't have a whole lot of weights that
are large but aren't doing anything.
The effect of an L2 penalty on the weights
is to prevent the network from using
weights that it doesn't need.
This often improves generalization a lot,
because it can use those weights that it
doesn't really need to fit the sampling
error.
It also makes a smoother model in which
the output changes more slowly as the
input changes.
So if the network has two very similar
inputs, when you put in an L2 weight
penalty, it prefers to put half the weight
on each of those two similar inputs rather
than all of the weight on one, as shown on
the right here.
If the two inputs are very similar, those
two networks will produce very similar
outputs.
But the one with the half weights will
have much less extreme changes in its
outputs when you change the inputs.
There are other kinds of weight penalty.
For example, an L1 penalty, where the cost
function is just this v shape.
So here what we're doing is we're
penalizing the absolute values of the
weights.
This has the nice effect that it drives
many of the weights to be exactly zero and
that helps a lot in interpretation.
If there's only a few non zero weights
left, it's much easier to understand
what's going on.
We can also use weight penalties that are
more extreme than L1 where the gradient of
the cost function actually gets smaller
when the weight gets really big.
This allows the network to keep large
weights without them being pulled towards
zero.
It's just the small weights that'll get
pulled towards zero.
So we then even more like that getting it
with a few large weight.
Instead of putting penalties on the
weights, we could actually use weight
constraints.
What I mean by that is instead of
penalizing the squared value of each
weight separately, we put a constraint on
the maximum squared length of the incoming
weight vector of each hidden unit or
output unit.
When we update the weights, if the length
of that incoming vector gets longer than
allowed by the constraint, we simply scale
the vector down by dividing all the
weights by the same amount until its
length fits the allowed length.
Using weight constraints like this, has a
number of advantages over weight
penalties, and I found these work
necessary better.
It's much easier to select the sensible
value for the squared length of the
incoming weight factor than it is to
select the weight penalty.
That's because, logistic units.
Have, a natural scale to them so we know
what a weight of one means.
Using weight constraints also prevents
hidden units getting stuck near zero with
all their weights being tiny and not doing
anything useful.
Because when all their weights are tiny,
there's no constraint on the weights.
So there's nothing preventing them
growing.
Weight constraints also prevent the weight
from exploding.
One of the subtle things that weight
constraints do is that when a unit hits
its constraint, the effective penalty on
all of its weights is determined by the
big gradients.
So if some of the incoming weights have
very big gradients, they'll be trying to
push the length of the incoming weight
factor up.
And that will push down on all the other
weights.
So in effect, if you think of it like a
penalty, the penalty scales itself so as
to be appropriate for the big weights and
to suppress the small weights.
This is much more effective than a fixed
penalty of pushing irrelevant weights
towards zero.
For those of you who knows about La Grange
multipliers,
The penalties of in just the La Grange
multipliers required to keep the
constraints satisfied.