1
00:00:00,000 --> 00:00:05,938
In this video, I'm going to talk about the
Bayesian interpretation of weight

2
00:00:05,938 --> 00:00:10,085
penalties.
In the full Bayesian approach, we try to

3
00:00:10,085 --> 00:00:15,657
compute the posterior probability of every
possible setting of the parameters of a

4
00:00:15,657 --> 00:00:18,992
model.
But there's a much reduced form of the

5
00:00:18,992 --> 00:00:24,032
Bayesian approach, where we simply say,
I'm going to look for the single set of

6
00:00:24,032 --> 00:00:29,138
parameters that is the best compromise
between fitting my prior beliefs about

7
00:00:29,138 --> 00:00:33,720
what the parameters should be like, and
fitting the data I've observed.

8
00:00:34,040 --> 00:00:38,868
This is called Maximum alpha Posteriori
learning and it gives us a nice

9
00:00:38,868 --> 00:00:45,195
explanation of what's really going on,
when we use weight decay to control the

10
00:00:45,195 --> 00:00:49,999
capacity of a model.
I'm now going to talk a bit about what's

11
00:00:49,999 --> 00:00:55,064
really going on, when we minimize the
squared error during supervised maximum

12
00:00:55,064 --> 00:01:00,398
likelihood learning.
Finding the weight vector that minimizes

13
00:01:00,398 --> 00:01:05,005
the squared residuals, that is the
differences between the target value and

14
00:01:05,005 --> 00:01:09,427
the value predicted by the net,
Is equivalent to finding a weight vector

15
00:01:09,427 --> 00:01:13,420
that maximizes the log probability density
of the correct answer.

16
00:01:14,020 --> 00:01:19,141
In order to see this equivalence, we have
to assume that the correct answer is

17
00:01:19,141 --> 00:01:23,672
produced by adding Gaussian noise to the
output of the neural net.

18
00:01:23,672 --> 00:01:28,990
So the idea is, we make a prediction by
first running the neural net on the input

19
00:01:28,990 --> 00:01:32,733
to get the output.
And then adding some Gaussain noise.

20
00:01:32,733 --> 00:01:37,920
And then we ask, what's the probability
that when we do that, we get the correct

21
00:01:37,920 --> 00:01:43,344
answer?
So the model's output is the center of a

22
00:01:43,344 --> 00:01:49,050
Gaussian and what we're interested in is
having the target value of high

23
00:01:49,050 --> 00:01:55,146
probability under that Gaussian because
the probability producing the value t,

24
00:01:55,146 --> 00:02:01,477
given that the network gives an output of
y is just the probability density of t

25
00:02:01,477 --> 00:02:11,688
under a Gaussian centered at y.
So the math looks like this: let's suppose

26
00:02:11,688 --> 00:02:17,350
that the output of the neural net on
training case c is yc and this output is

27
00:02:17,350 --> 00:02:25,780
produced by applying the weights W to the
input c. The probability that we'll get

28
00:02:25,780 --> 00:02:32,509
the correct target value when we add
Gaussian noise to that output yc is given

29
00:02:32,509 --> 00:02:38,892
by a Gaussian centered on yc.
So we're interested in the probability

30
00:02:38,892 --> 00:02:44,261
density of the target value, under a
Gaussian centered at the output of the

31
00:02:44,261 --> 00:02:47,554
neural net.
And on the right here, we have that

32
00:02:47,554 --> 00:02:53,786
Gaussian distribution, with mean yc.
We also have to assume some variance, and

33
00:02:53,786 --> 00:03:01,927
that variance will be important later.
If we now take logs and put in a minus

34
00:03:01,927 --> 00:03:05,805
sign,
We see that the negative log probability

35
00:03:05,805 --> 00:03:12,887
density of the target value tc given that
the network outputs yc, is a constant that

36
00:03:12,887 --> 00:03:19,715
comes from the normalizing term of the
Gaussian plus the log of that exponential

37
00:03:19,715 --> 00:03:25,364
with the minus sign in which is just (tc2
- yc)^2 divided by twice the variance of

38
00:03:25,364 --> 00:03:30,145
the Gaussian.
So what you see is that, if our cost

39
00:03:30,145 --> 00:03:36,291
function is the negative log probability
of getting the right answer, that turns

40
00:03:36,291 --> 00:03:44,069
into minimizing a squared distance.
It's helpful to know that whenever you see

41
00:03:44,069 --> 00:03:48,788
a squared error being minimized, you can
make a probabilistic interpretation of

42
00:03:48,788 --> 00:03:53,688
what's going on, and in that probabilistic
interpretation, you'll be maximizing the

43
00:03:53,688 --> 00:04:01,197
log probability under a Gausian.
So the proper Bayesian approach, is to

44
00:04:01,197 --> 00:04:05,400
find the full posterior distribution over
all possible weight vectors.

45
00:04:06,060 --> 00:04:11,565
If there's more than a handful of weights,
that's hopelessly difficult when you have

46
00:04:11,565 --> 00:04:15,705
a non-linear net.
Bayesians have a lot of ways of

47
00:04:15,705 --> 00:04:19,899
approximating this distribution, often
using Monte Carlo methods.

48
00:04:19,899 --> 00:04:23,700
But for the time being, let's try and do
something simpler.

49
00:04:23,980 --> 00:04:27,889
Let's just try to find the most probable
weight vector.

50
00:04:27,889 --> 00:04:33,149
So the single setting of the weights
that's most probable given the prior

51
00:04:33,149 --> 00:04:40,268
knowledge we have and given the data.
So what we're going to try and do is find

52
00:04:40,268 --> 00:04:44,618
an optimal value of W by starting with
some random weight vector, and then

53
00:04:44,618 --> 00:04:49,026
adjusting it in the direction that
improves the probability of that weight

54
00:04:49,026 --> 00:04:53,280
factor given the data.
It will only be a local optimum.

55
00:04:55,620 --> 00:05:01,143
Now, it's going to be easier to work in
the log domain than in the probability

56
00:05:01,143 --> 00:05:04,310
domain.
So if we want to minimize a cost, we

57
00:05:04,310 --> 00:05:10,745
better use negative log props.
Just an aside about why we maximize sums

58
00:05:10,745 --> 00:05:14,940
of log probabilities, or minimize sums of
negative log probs,

59
00:05:15,400 --> 00:05:20,327
What we really want to do is maximize the
probability of the data, which is

60
00:05:20,327 --> 00:05:25,057
maximizing the product of the
probabilities of producing all the target

61
00:05:25,057 --> 00:05:29,000
values that we observed on all the
different training cases.

62
00:05:30,140 --> 00:05:35,475
If we assume that the output errors on
different cases are independent, we can

63
00:05:35,475 --> 00:05:41,016
write that down as the product over all
the training cases, of the probability of

64
00:05:41,016 --> 00:05:44,300
producing the target value, tc, given the
weights.

65
00:05:44,640 --> 00:05:50,477
That is the product of the probability of
producing tc, given the output that we're

66
00:05:50,477 --> 00:05:55,400
going to get from out network, if we give
it input c and it has weights W.

67
00:05:58,300 --> 00:06:03,143
The log functions monotonic, and so it
can't change where the maxima are.

68
00:06:03,143 --> 00:06:08,458
So instead of maximizing a product of
probabilities, we can maximize the sums of

69
00:06:08,458 --> 00:06:13,100
log probabilities, and that typically
works much better on a computer.

70
00:06:13,100 --> 00:06:18,950
It's much more stable.
So we maximize the log probability of the

71
00:06:18,950 --> 00:06:24,048
data given the weights, which is simply
maximizing the sum over all training cases

72
00:06:24,048 --> 00:06:29,021
of the log probability of the output for
that training case, given the input and

73
00:06:29,021 --> 00:06:33,625
given the weights.
In Maximum a Posteriori learning, we're

74
00:06:33,625 --> 00:06:39,835
trying to find the set of weights that
optimizes the trade off between fitting

75
00:06:39,835 --> 00:06:44,080
our prior and fitting the data.
So that's base theorem.

76
00:06:44,900 --> 00:06:50,270
If we take negative logs to get a cost, we
get that the negative log of the

77
00:06:50,270 --> 00:06:56,284
probability of the weights given the data,
is the negative log of the prior term, and

78
00:06:56,284 --> 00:07:00,007
the negative log of the data term, and an
extra term.

79
00:07:00,007 --> 00:07:04,733
So, that last extra term, is an integral
overall possible weight vectors.

80
00:07:04,733 --> 00:07:09,816
And so that doesn't affect W.
So we can ignore it when we're optimizing

81
00:07:09,816 --> 00:07:15,116
W.
The term that depends on the data is the

82
00:07:15,116 --> 00:07:20,160
negative log probability is given W, and
that's our normal error term.

83
00:07:21,460 --> 00:07:27,451
And then term that only depends on W is
the negative log probability of W under

84
00:07:27,451 --> 00:07:32,710
its prior.
Maximizing the log probability of a weight

85
00:07:32,710 --> 00:07:37,120
is related to minimizing a squared
distance, in just the same way as

86
00:07:37,120 --> 00:07:42,122
maximizing the log probability of
producing correct target value is related

87
00:07:42,122 --> 00:07:48,124
to minimizing the square distance.
So minimizing the squared weights is

88
00:07:48,124 --> 00:07:52,800
equivalent to maximizing the log
probability of the weights under a

89
00:07:52,800 --> 00:07:56,682
zero-mean Gaussian prior.
So here's a Gaussian.

90
00:07:56,682 --> 00:08:02,411
It's got a mean zero, and we want to
maximize the probability of the weights,

91
00:08:02,411 --> 00:08:08,366
or the log probability of the weights.
And to do that, we obviously want w to be

92
00:08:08,366 --> 00:08:14,784
close to the mean zero.
The equation for the Gaussian is just like

93
00:08:14,784 --> 00:08:19,000
this, where the mean is zero so we don't
have to put it in.

94
00:08:19,320 --> 00:08:25,375
And the log probability of w is then the
squared weights scaled by twice the

95
00:08:25,375 --> 00:08:31,509
variance, plus a constant that comes from
the normalizing term of the Gaussian.

96
00:08:31,509 --> 00:08:38,853
And isn't affected when we change w.
So finally we can get to the basing

97
00:08:38,853 --> 00:08:42,160
interpretation of weight decay or weight
penalties.

98
00:08:42,440 --> 00:08:47,321
We're trying to minimize the negative log
probability of the weights given in the

99
00:08:47,321 --> 00:08:52,143
data and that involves minimizing a term
that depends on by the turn the weights

100
00:08:52,143 --> 00:08:57,143
namely how will we shift the targets and
determine that depends only on the

101
00:08:57,143 --> 00:09:03,919
weights.
Is derived from the log probability of the

102
00:09:03,919 --> 00:09:09,401
data given the weights, which if we assume
Gaussian noise is added to the output of

103
00:09:09,401 --> 00:09:14,289
the model to make the prediction, then
that log probability is the squared

104
00:09:14,289 --> 00:09:19,506
distance between the output of the net on
the target value scaled by twice the

105
00:09:19,506 --> 00:09:25,414
variance of that Gaussian noise.
Similarly, if we assume we have a Gaussian

106
00:09:25,414 --> 00:09:31,929
prior for the weights, the log probability
of a weight under the prior is the squared

107
00:09:31,929 --> 00:09:37,640
value of that weight scaled by twice the
variance of the Gaussian prior.

108
00:09:40,220 --> 00:09:45,688
So now let's take that equation and
multiply through by two sigma squared D.

109
00:09:45,688 --> 00:09:51,373
So, we got a new cost function. And the
first term, when we multiply through turns

110
00:09:51,373 --> 00:09:57,058
into simply the sum of all training cases
of the squared difference between the

111
00:09:57,058 --> 00:10:02,527
output of the net and the target.
That's the squared error that we typically

112
00:10:02,527 --> 00:10:07,952
minimize in the neural net.
The second term now becomes, the ratio of

113
00:10:07,952 --> 00:10:12,454
two variances times the sum of the squares
of the weights.

114
00:10:12,454 --> 00:10:19,080
And so what you see is, the ratio of those
two variances is exactly the weight

115
00:10:19,080 --> 00:10:22,658
penalty.
So we initially thought of weight

116
00:10:22,658 --> 00:10:27,137
penalties as just a number you make up to
try and make things work better.

117
00:10:27,137 --> 00:10:31,496
Where you fix the value of the weight
penalty by using a validation set.

118
00:10:31,496 --> 00:10:36,096
But now we see that if we make this
Gaussian interpretation where we have a

119
00:10:36,096 --> 00:10:40,999
Gaussian prior and we have a Gaussian
model of the relation of the output of the

120
00:10:40,999 --> 00:10:44,571
net to the target,
Then the weight penalty is determined by

121
00:10:44,571 --> 00:10:48,929
the variances of those Gaussians.
It's just the ratio of those variances.

122
00:10:48,929 --> 00:10:52,380
It's not an arbitrary thing at all within
this framework.