1
00:00:00,000 --> 00:00:05,433
In this video, I'll talk about how we can
control compacity by limiting the size of

2
00:00:05,433 --> 00:00:08,444
the weights.
The standard way to do this is to

3
00:00:08,444 --> 00:00:12,961
introduce a penalty that prevents the
weights from getting too large.

4
00:00:12,961 --> 00:00:17,870
With the implicite assumption that a
network with small weights is somehow

5
00:00:17,870 --> 00:00:24,263
simpler than a network with large weights.
There are several different penalty terms

6
00:00:24,263 --> 00:00:28,129
we can use.
And it's also possible to constrain the

7
00:00:28,129 --> 00:00:34,042
weights, so that the incoming weight
vector to each of the hidden units is not

8
00:00:34,042 --> 00:00:37,302
allowed to be longer than a certain
length.

9
00:00:37,302 --> 00:00:43,139
The standard weight can limit the size of
the weights, is to use an L2 weight

10
00:00:43,139 --> 00:00:48,370
penalty, which means that we penalize the
squared value of the weight.

11
00:00:48,370 --> 00:00:53,770
This is sometimes called weight decay in
the neural network literature, because the

12
00:00:53,770 --> 00:01:00,659
derivative of that penalty acts like a
force pulling the weight towards zero.

13
00:01:00,660 --> 00:01:05,419
This weight penalty keeps the weight
small, unless they have big urge

14
00:01:05,419 --> 00:01:11,707
derivatives to counteract it.
So if you look at what the penalty term

15
00:01:11,707 --> 00:01:17,420
looks like, as the weight moves away from
zero, you get this parabolic cost.

16
00:01:19,420 --> 00:01:25,326
If you look at the equation, the cost that
you're optimizing is the normal error that

17
00:01:25,326 --> 00:01:31,232
you're trying to reduce, plus a term which
is the sum of the squares of the weights,

18
00:01:31,232 --> 00:01:36,356
with a coefficient in front, Lambda.
And we divide by two so that when we

19
00:01:36,356 --> 00:01:41,548
differentiate the two is cancelled.
That coefficient in front of the sum

20
00:01:41,548 --> 00:01:45,382
squared weights is sometimes called the
weight cost.

21
00:01:45,382 --> 00:01:51,334
That determines how strong the penalty is.
If you differentiate, you can see the

22
00:01:51,334 --> 00:01:56,512
derivative of this cost is just the
derivative of the error plus something

23
00:01:56,512 --> 00:02:01,000
that's to do with the size of the weight
and the value of Lambda.

24
00:02:02,700 --> 00:02:08,553
That derivative will be zero when the
magnitude of the weight is one over Lambda

25
00:02:08,553 --> 00:02:13,975
times the magnitude of the derivative.
So the only way you can have big weights,

26
00:02:13,975 --> 00:02:18,400
when you're at a minimum of the cost
function, is if they also have big error

27
00:02:18,400 --> 00:02:21,503
derivatives.
And this makes the weights much easier to

28
00:02:21,503 --> 00:02:24,548
interpret.
You don't have a whole lot of weights that

29
00:02:24,548 --> 00:02:29,087
are large but aren't doing anything.
The effect of an L2 penalty on the weights

30
00:02:29,087 --> 00:02:32,880
is to prevent the network from using
weights that it doesn't need.

31
00:02:33,580 --> 00:02:38,626
This often improves generalization a lot,
because it can use those weights that it

32
00:02:38,626 --> 00:02:41,457
doesn't really need to fit the sampling
error.

33
00:02:41,457 --> 00:02:46,258
It also makes a smoother model in which
the output changes more slowly as the

34
00:02:46,258 --> 00:02:50,015
input changes.
So if the network has two very similar

35
00:02:50,015 --> 00:02:54,767
inputs, when you put in an L2 weight
penalty, it prefers to put half the weight

36
00:02:54,767 --> 00:02:59,946
on each of those two similar inputs rather
than all of the weight on one, as shown on

37
00:02:59,946 --> 00:03:03,873
the right here.
If the two inputs are very similar, those

38
00:03:03,873 --> 00:03:06,754
two networks will produce very similar
outputs.

39
00:03:06,754 --> 00:03:11,474
But the one with the half weights will
have much less extreme changes in its

40
00:03:11,474 --> 00:03:18,864
outputs when you change the inputs.
There are other kinds of weight penalty.

41
00:03:18,864 --> 00:03:23,310
For example, an L1 penalty, where the cost
function is just this v shape.

42
00:03:23,310 --> 00:03:27,819
So here what we're doing is we're
penalizing the absolute values of the

43
00:03:27,819 --> 00:03:31,848
weights.
This has the nice effect that it drives

44
00:03:31,848 --> 00:03:36,856
many of the weights to be exactly zero and
that helps a lot in interpretation.

45
00:03:36,856 --> 00:03:41,672
If there's only a few non zero weights
left, it's much easier to understand

46
00:03:41,672 --> 00:03:45,396
what's going on.
We can also use weight penalties that are

47
00:03:45,396 --> 00:03:50,725
more extreme than L1 where the gradient of
the cost function actually gets smaller

48
00:03:50,725 --> 00:03:55,482
when the weight gets really big.
This allows the network to keep large

49
00:03:55,482 --> 00:03:58,377
weights without them being pulled towards
zero.

50
00:03:58,380 --> 00:04:03,216
It's just the small weights that'll get
pulled towards zero.

51
00:04:03,220 --> 00:04:07,900
So we then even more like that getting it
with a few large weight.

52
00:04:08,540 --> 00:04:13,470
Instead of putting penalties on the
weights, we could actually use weight

53
00:04:13,470 --> 00:04:16,577
constraints.
What I mean by that is instead of

54
00:04:16,577 --> 00:04:21,846
penalizing the squared value of each
weight separately, we put a constraint on

55
00:04:21,846 --> 00:04:27,249
the maximum squared length of the incoming
weight vector of each hidden unit or

56
00:04:27,249 --> 00:04:32,830
output unit.
When we update the weights, if the length

57
00:04:32,830 --> 00:04:38,196
of that incoming vector gets longer than
allowed by the constraint, we simply scale

58
00:04:38,196 --> 00:04:42,915
the vector down by dividing all the
weights by the same amount until its

59
00:04:42,915 --> 00:04:49,465
length fits the allowed length.
Using weight constraints like this, has a

60
00:04:49,465 --> 00:04:53,723
number of advantages over weight
penalties, and I found these work

61
00:04:53,723 --> 00:04:58,531
necessary better.
It's much easier to select the sensible

62
00:04:58,531 --> 00:05:02,798
value for the squared length of the
incoming weight factor than it is to

63
00:05:02,798 --> 00:05:07,120
select the weight penalty.
That's because, logistic units.

64
00:05:07,120 --> 00:05:12,480
Have, a natural scale to them so we know
what a weight of one means.

65
00:05:14,640 --> 00:05:18,933
Using weight constraints also prevents
hidden units getting stuck near zero with

66
00:05:18,933 --> 00:05:22,060
all their weights being tiny and not doing
anything useful.

67
00:05:22,060 --> 00:05:26,141
Because when all their weights are tiny,
there's no constraint on the weights.

68
00:05:26,141 --> 00:05:28,420
So there's nothing preventing them
growing.

69
00:05:28,840 --> 00:05:32,340
Weight constraints also prevent the weight
from exploding.

70
00:05:32,880 --> 00:05:38,026
One of the subtle things that weight
constraints do is that when a unit hits

71
00:05:38,026 --> 00:05:43,373
its constraint, the effective penalty on
all of its weights is determined by the

72
00:05:43,373 --> 00:05:46,983
big gradients.
So if some of the incoming weights have

73
00:05:46,983 --> 00:05:52,263
very big gradients, they'll be trying to
push the length of the incoming weight

74
00:05:52,263 --> 00:05:55,672
factor up.
And that will push down on all the other

75
00:05:55,672 --> 00:05:58,813
weights.
So in effect, if you think of it like a

76
00:05:58,813 --> 00:06:04,294
penalty, the penalty scales itself so as
to be appropriate for the big weights and

77
00:06:04,294 --> 00:06:10,554
to suppress the small weights.
This is much more effective than a fixed

78
00:06:10,554 --> 00:06:14,459
penalty of pushing irrelevant weights
towards zero.

79
00:06:14,460 --> 00:06:17,556
For those of you who knows about La Grange
multipliers,

80
00:06:17,556 --> 00:06:21,611
The penalties of in just the La Grange
multipliers required to keep the

81
00:06:21,611 --> 00:06:22,850
constraints satisfied.