1
00:00:00,000 --> 00:00:00,710
[MUSIC]

2
00:00:00,710 --> 00:00:08,154
In the previous video,
we completely defined our model.

3
00:00:08,154 --> 00:00:13,471
And now everything that is left is
to understand how to maximize it,

4
00:00:13,471 --> 00:00:18,530
with respect to the weights of
both neural networks, w and phi.

5
00:00:18,530 --> 00:00:21,584
So we have to maximize
this kind of objective.

6
00:00:21,584 --> 00:00:25,387
And since it hasn't
an expect value inside,

7
00:00:25,387 --> 00:00:30,501
we have to approximate it with
Monte Carlo somehow, right.

8
00:00:30,501 --> 00:00:33,446
So let's look closer into the subject.

9
00:00:33,446 --> 00:00:36,622
First of all part is easy.

10
00:00:36,622 --> 00:00:41,513
Because it's just KL distance
between some Gaussian with

11
00:00:41,513 --> 00:00:45,620
known parameters, and
the standard Gaussian.

12
00:00:45,620 --> 00:00:48,450
So we can compute this term analytically.

13
00:00:48,450 --> 00:00:52,662
So although it has an integral inside,
we can compute it analytically.

14
00:00:52,662 --> 00:00:56,215
And this expression will
not cause us any trouble,

15
00:00:56,215 --> 00:01:02,010
both in terms of evaluating it and finding
gradients with respective parameters.

16
00:01:02,010 --> 00:01:07,702
So we can just not think about it and
let TensorFlow think about the gradients,

17
00:01:07,702 --> 00:01:13,155
if we define the diversions,
as this kind of analytical formula.

18
00:01:13,155 --> 00:01:16,938
So let's look a little closer into
the first term of this expression.

19
00:01:16,938 --> 00:01:21,332
That's called f of parameters w and phi.

20
00:01:21,332 --> 00:01:24,602
So this function is sum
with respect to objects,

21
00:01:24,602 --> 00:01:28,526
of expected values of
logarithm of probability.

22
00:01:28,526 --> 00:01:33,792
And recall that we decided that
each q i of individual object,

23
00:01:33,792 --> 00:01:38,972
would be some distribution,
which on t i, given x i and phi.

24
00:01:38,972 --> 00:01:43,842
Which is defined by convolutional
neural networks with parameters phi.

25
00:01:43,842 --> 00:01:46,543
So let's re-write it as false, and

26
00:01:46,543 --> 00:01:52,041
let's start with looks at the gradient
of this function with respect to w.

27
00:01:52,041 --> 00:01:56,115
So the gradient of this
function with respect to w,

28
00:01:56,115 --> 00:02:01,443
it looks as false, so half the gradient
of sum of expected values.

29
00:02:01,443 --> 00:02:06,239
And we'll write the expected
value by the definition.

30
00:02:06,239 --> 00:02:09,789
So latent variable t i is continuous,
and thus,

31
00:02:09,789 --> 00:02:15,592
the expected value is just the integral
of the probability times the function,

32
00:02:15,592 --> 00:02:18,031
the logarithm of p of xi given t i.

33
00:02:18,031 --> 00:02:22,659
Now, we can move the gradient
sign inside the summation.

34
00:02:22,659 --> 00:02:28,040
Because summing, taking the gradient
do not interfere with each other,

35
00:02:28,040 --> 00:02:29,790
we can swap this sides.

36
00:02:29,790 --> 00:02:32,889
And also for smooth and nice functions,

37
00:02:32,889 --> 00:02:38,647
we usually can swap the integration and
great gradient sides, like this.

38
00:02:38,647 --> 00:02:42,231
Finally, since the first function
q of t i given x i and phi,

39
00:02:42,231 --> 00:02:46,992
it doesn't depend on w, so we can easily
push the equation side even further.

40
00:02:46,992 --> 00:02:52,580
Because this q is just
a constant with respect to w.

41
00:02:52,580 --> 00:02:55,115
And it doesn't affect
the value of gradient,

42
00:02:55,115 --> 00:02:58,820
we just have to multiply the gradient
of logarithm with this value.

43
00:02:58,820 --> 00:03:03,798
And now we can see that what we
obtained is just an expected value of

44
00:03:03,798 --> 00:03:05,619
the gradient, right?

45
00:03:05,619 --> 00:03:08,336
Sum with respect to
the objects in theta set,

46
00:03:08,336 --> 00:03:11,133
expected value of
the gradient of logarithm.

47
00:03:11,133 --> 00:03:14,120
And we can approximate this
expect failure by sampling.

48
00:03:14,120 --> 00:03:17,182
So we can sample one, for example,

49
00:03:17,182 --> 00:03:21,273
point from the [INAUDIBLE]
distribution q of t i.

50
00:03:21,273 --> 00:03:26,728
And then put that inside
the logarithm of p of x i given t i,

51
00:03:26,728 --> 00:03:30,835
compute its gradient, with respect to w.

52
00:03:30,835 --> 00:03:36,468
So basically what we're doing here is
just we're passing our image through our

53
00:03:36,468 --> 00:03:42,199
[INAUDIBLE], to get the parameters of the
variation distribution theory q of t i.

54
00:03:42,199 --> 00:03:46,056
Then we sample on point from
the variation distribution.

55
00:03:46,056 --> 00:03:51,348
And then we put this point as input to
the secondary network with parameters w.

56
00:03:51,348 --> 00:03:55,976
And then we just compute the usual
gradient of this second neural

57
00:03:55,976 --> 00:03:58,987
network with respect to its parameters.

58
00:03:58,987 --> 00:04:02,312
And given that its input
is this sample t i hat.

59
00:04:02,312 --> 00:04:05,248
So this is just the usual gradient.

60
00:04:05,248 --> 00:04:08,586
We can use TensorFlow to
find it automatically.

61
00:04:08,586 --> 00:04:12,706
And finally,
this thing depends on the whole data set,

62
00:04:12,706 --> 00:04:17,100
but we can easily approximate
it with a mini bunch, right?

63
00:04:17,100 --> 00:04:22,284
We can write it as some constants to
normalize things of some, with respect to

64
00:04:22,284 --> 00:04:28,046
minimization of random objects which we
have chosen for this particular iteration.

65
00:04:28,046 --> 00:04:32,379
And this is a standard stochastic
gradient for a neural network.

66
00:04:32,379 --> 00:04:36,692
So you don't have to think here too much,
you just have to find the gradient

67
00:04:36,692 --> 00:04:40,185
with TensorFlow of your second
part of your neural network,

68
00:04:40,185 --> 00:04:41,909
with respect to parameters.

69
00:04:43,967 --> 00:04:46,771
So the overall [INAUDIBLE] here is false.

70
00:04:46,771 --> 00:04:52,359
We have our function, our objective,
we pass our input image through

71
00:04:52,359 --> 00:04:57,390
the first convolutional neural
network with parameters phi.

72
00:04:57,390 --> 00:05:00,856
We find the parameters m and
s of variation distribution q.

73
00:05:00,856 --> 00:05:05,932
We sample one point from this
Gaussian with parameters m and s.

74
00:05:05,932 --> 00:05:11,073
We put this point t i hat inside
the second convolutional neural network w.

75
00:05:11,073 --> 00:05:14,974
And then we treat this t
i hat as an input data,

76
00:05:14,974 --> 00:05:20,849
as a training object for
the second convolutional neural network.

77
00:05:20,849 --> 00:05:24,581
And then we just output, we compute
the objective of this second CNN.

78
00:05:24,581 --> 00:05:29,283
And then just use TensorFlow to
differentiate it with respect to

79
00:05:29,283 --> 00:05:30,336
parameters.

80
00:05:30,336 --> 00:05:35,170
Note that here, we always used unwise
estimation of the expected values.

81
00:05:35,170 --> 00:05:39,815
So we always substitute expected
values with sample averages, and

82
00:05:39,815 --> 00:05:45,522
not some more complicated expressions,
which are not obviously unbiased or not.

83
00:05:45,522 --> 00:05:48,865
So here, everything is unbiased,
and on others,

84
00:05:48,865 --> 00:05:53,161
this stochastic approximation
of the gradient will be correct.

85
00:05:53,161 --> 00:05:58,336
And if you do enough iterations,
then you will

86
00:05:58,336 --> 00:06:04,318
converge to some good point
in your parameter space.

87
00:06:04,318 --> 00:06:14,318
[MUSIC]