1
00:00:00,000 --> 00:00:01,087
[MUSIC]

2
00:00:01,087 --> 00:00:05,346
Okay, so now let's discuss how to find

3
00:00:05,346 --> 00:00:10,776
the gradient with respect
to the parameters phi.

4
00:00:10,776 --> 00:00:15,350
So here's our objective, and
we want to differentiate it.

5
00:00:15,350 --> 00:00:20,415
And again, let's rewrite
the definition of the expected value

6
00:00:20,415 --> 00:00:25,501
as an integral of probability
times logarithm as the function.

7
00:00:25,501 --> 00:00:28,691
And again, we can move the expected
value inside the summation,

8
00:00:28,691 --> 00:00:31,660
it will not change anything,
and also inside the integral.

9
00:00:31,660 --> 00:00:34,585
Again, if the functions are smooth and

10
00:00:34,585 --> 00:00:39,293
nice then we can swap integration and
differentiation signs.

11
00:00:39,293 --> 00:00:43,613
However, in contrast to
the case in our previous video,

12
00:00:43,613 --> 00:00:47,933
we cannot put the differentiation
sign here inside the,

13
00:00:47,933 --> 00:00:51,090
so to push it forward near the logarithm.

14
00:00:51,090 --> 00:00:55,969
Well, first of all, because
the gradient of logarithm of p of x Is

15
00:00:55,969 --> 00:01:00,869
zero because it doesn't depend on phi,
so the gradient is zero.

16
00:01:00,869 --> 00:01:05,264
And so the right-hand side of
this expression is just zero,

17
00:01:05,264 --> 00:01:09,152
and it's obviously not what
the left-hand side is.

18
00:01:09,152 --> 00:01:13,696
And the reason why we can't do that is
because the q itself dependent on phi.

19
00:01:13,696 --> 00:01:19,707
So we have to find this
gradient of q respect to phi.

20
00:01:19,707 --> 00:01:24,509
And if we do that,
then the problem here is that we no

21
00:01:24,509 --> 00:01:27,538
longer have an expected value.

22
00:01:27,538 --> 00:01:32,699
So if you look at the first
equation this slide,

23
00:01:32,699 --> 00:01:39,161
it is sum of integrals of gradient
of q times logarithm of p.

24
00:01:39,161 --> 00:01:42,898
And this thing is not an expected value
with respect to any distribution.

25
00:01:42,898 --> 00:01:45,705
So you can't approximate
it with Monte Carlo,

26
00:01:45,705 --> 00:01:48,512
you can't sample from some distribution,
and

27
00:01:48,512 --> 00:01:53,377
then use the samples to approximate it,
because there are no distribution here.

28
00:01:53,377 --> 00:01:57,535
Here is just gradient for distribution,
which is not a distribution, and

29
00:01:57,535 --> 00:02:01,238
a logarithm of a distribution,
which is also not a distribution.

30
00:02:01,238 --> 00:02:05,279
So we can't approximate this
thing with Monte Carlo.

31
00:02:05,279 --> 00:02:11,552
And how can we do it, how can we
approximate this gradient with something?

32
00:02:11,552 --> 00:02:15,060
Well, one thing we can
do is the following.

33
00:02:15,060 --> 00:02:19,492
We can artificially add
some distribution inside.

34
00:02:19,492 --> 00:02:24,286
So we can multiply and
divide by some distribution q.

35
00:02:24,286 --> 00:02:31,175
And then we can treat this
q as the probabilities.

36
00:02:31,175 --> 00:02:36,034
And then the gradient of q times
log p divided by q is the function

37
00:02:36,034 --> 00:02:39,718
which we're computing
the expected value of.

38
00:02:39,718 --> 00:02:44,012
Or if you simplify this
expression a little bit,

39
00:02:44,012 --> 00:02:49,248
we can say that the gradient of
q divided by q is just gradient

40
00:02:49,248 --> 00:02:55,872
gradient of logarithm of q by the
definition of the gradient of logarithm.

41
00:02:55,872 --> 00:02:59,537
And then we can rewrite
this formula as follows.

42
00:02:59,537 --> 00:03:04,389
So it's an integral of
q times the gradient of

43
00:03:04,389 --> 00:03:09,254
logarithm of q times logarithm of p,
right?

44
00:03:09,254 --> 00:03:15,934
So it's just exact formula, we didn't lose
anything on some kind of approximation.

45
00:03:15,934 --> 00:03:19,864
And now we can say that this last
expression is an expect value with

46
00:03:19,864 --> 00:03:20,730
respect to q.

47
00:03:20,730 --> 00:03:27,079
It's an expected value for this gradient
of logarithm of q times logarithm of p.

48
00:03:27,079 --> 00:03:32,359
And this sometimes called log-derivative
trick, and works for any distribution.

49
00:03:32,359 --> 00:03:36,307
So it allows you to differentiate
some expected value,

50
00:03:36,307 --> 00:03:41,685
even if the gradient of this expected
value is not an expected value itself.

51
00:03:41,685 --> 00:03:46,759
So now you have an expected value again,
and you can sample from the q,

52
00:03:46,759 --> 00:03:50,718
and then approximate this
gradient with Monte Carlo.

53
00:03:50,718 --> 00:03:56,738
It's a valid approach, and until recently
people used it, and so it kind of worked.

54
00:03:56,738 --> 00:04:05,225
But the problem here is that actually
this expected value is a correct value.

55
00:04:05,225 --> 00:04:07,949
So it's an exact expression,
we didn't lose anything.

56
00:04:07,949 --> 00:04:11,232
But if you try to approximate
this thing with Monte Carlo,

57
00:04:11,232 --> 00:04:13,649
we'll get a really loose approximation.

58
00:04:13,649 --> 00:04:19,027
Because the variance of that will be high,
and we'll have to sample lots and

59
00:04:19,027 --> 00:04:22,557
lots and
lots of points to get an approximation for

60
00:04:22,557 --> 00:04:26,021
gradient that is at least
a little bit accurate.

61
00:04:26,021 --> 00:04:31,114
And the reason for that is because,
we have this logarithm of p of x.

62
00:04:31,114 --> 00:04:37,291
And when we start our training,
this p of x is as low as possible, right?

63
00:04:37,291 --> 00:04:41,414
because p of x is a distribution
over natural images, and

64
00:04:41,414 --> 00:04:45,083
has to assign some
distribution to any image.

65
00:04:45,083 --> 00:04:48,886
And so at the start, when we don't
know anything about our data,

66
00:04:48,886 --> 00:04:52,148
any image is really improbable
according to our model.

67
00:04:52,148 --> 00:04:57,921
So logarithm of this probability may
be like -1 million or something.

68
00:04:57,921 --> 00:05:02,387
So the model at the beginning doesn't
get used to this training data.

69
00:05:02,387 --> 00:05:07,796
So it thinks that these training
images are really, really improbable.

70
00:05:07,796 --> 00:05:13,159
And this means that we are finding
an expected value of something,

71
00:05:13,159 --> 00:05:14,894
times -1 million.

72
00:05:14,894 --> 00:05:19,357
And then because the gradients
of the first term,

73
00:05:19,357 --> 00:05:24,671
the gradient of logarithm of q
can be positive or negative,

74
00:05:24,671 --> 00:05:29,257
then we do Monte Carlo,
and average a few samples.

75
00:05:29,257 --> 00:05:36,051
We'll get like -1 million plus
900,000 minus 1,100,000 and etc.

76
00:05:36,051 --> 00:05:40,218
So we'll get really, really high
values in the absolute values, but

77
00:05:40,218 --> 00:05:42,200
they will be of different signs.

78
00:05:42,200 --> 00:05:47,202
And on others, they will be true,
they will be around, I don't know, 100.

79
00:05:47,202 --> 00:05:50,747
And this is exact value for
the gradient in this case, for example.

80
00:05:50,747 --> 00:05:55,343
But the variance will be so
high that you will have to use lots and

81
00:05:55,343 --> 00:05:59,620
lots of gradients to approximate
this thing accurately.

82
00:05:59,620 --> 00:06:03,607
And note that we didn't have this
problem in the previous video,

83
00:06:03,607 --> 00:06:07,967
because instead of logarithm of p,
we had a gradient of logarithm of p.

84
00:06:07,967 --> 00:06:11,555
And even if logarithm of p
is really like -1 million,

85
00:06:11,555 --> 00:06:15,383
then the gradient of that will
probably not be that large.

86
00:06:15,383 --> 00:06:18,352
So this is a problem,
and in the next video,

87
00:06:18,352 --> 00:06:23,587
we'll talk about one nice solution to
this problem in this particular case.

88
00:06:23,587 --> 00:06:31,001
So how can we estimate this gradient
with a small variance estimate?

89
00:06:31,001 --> 00:06:41,001
[MUSIC]