[MUSIC] In the previous video,
we completely defined our model. And now everything that is left is
to understand how to maximize it, with respect to the weights of
both neural networks, w and phi. So we have to maximize
this kind of objective. And since it hasn't
an expect value inside, we have to approximate it with
Monte Carlo somehow, right. So let's look closer into the subject. First of all part is easy. Because it's just KL distance
between some Gaussian with known parameters, and
the standard Gaussian. So we can compute this term analytically. So although it has an integral inside,
we can compute it analytically. And this expression will
not cause us any trouble, both in terms of evaluating it and finding
gradients with respective parameters. So we can just not think about it and
let TensorFlow think about the gradients, if we define the diversions,
as this kind of analytical formula. So let's look a little closer into
the first term of this expression. That's called f of parameters w and phi. So this function is sum
with respect to objects, of expected values of
logarithm of probability. And recall that we decided that
each q i of individual object, would be some distribution,
which on t i, given x i and phi. Which is defined by convolutional
neural networks with parameters phi. So let's re-write it as false, and let's start with looks at the gradient
of this function with respect to w. So the gradient of this
function with respect to w, it looks as false, so half the gradient
of sum of expected values. And we'll write the expected
value by the definition. So latent variable t i is continuous,
and thus, the expected value is just the integral
of the probability times the function, the logarithm of p of xi given t i. Now, we can move the gradient
sign inside the summation. Because summing, taking the gradient
do not interfere with each other, we can swap this sides. And also for smooth and nice functions, we usually can swap the integration and
great gradient sides, like this. Finally, since the first function
q of t i given x i and phi, it doesn't depend on w, so we can easily
push the equation side even further. Because this q is just
a constant with respect to w. And it doesn't affect
the value of gradient, we just have to multiply the gradient
of logarithm with this value. And now we can see that what we
obtained is just an expected value of the gradient, right? Sum with respect to
the objects in theta set, expected value of
the gradient of logarithm. And we can approximate this
expect failure by sampling. So we can sample one, for example, point from the [INAUDIBLE]
distribution q of t i. And then put that inside
the logarithm of p of x i given t i, compute its gradient, with respect to w. So basically what we're doing here is
just we're passing our image through our [INAUDIBLE], to get the parameters of the
variation distribution theory q of t i. Then we sample on point from
the variation distribution. And then we put this point as input to
the secondary network with parameters w. And then we just compute the usual
gradient of this second neural network with respect to its parameters. And given that its input
is this sample t i hat. So this is just the usual gradient. We can use TensorFlow to
find it automatically. And finally,
this thing depends on the whole data set, but we can easily approximate
it with a mini bunch, right? We can write it as some constants to
normalize things of some, with respect to minimization of random objects which we
have chosen for this particular iteration. And this is a standard stochastic
gradient for a neural network. So you don't have to think here too much,
you just have to find the gradient with TensorFlow of your second
part of your neural network, with respect to parameters. So the overall [INAUDIBLE] here is false. We have our function, our objective,
we pass our input image through the first convolutional neural
network with parameters phi. We find the parameters m and
s of variation distribution q. We sample one point from this
Gaussian with parameters m and s. We put this point t i hat inside
the second convolutional neural network w. And then we treat this t
i hat as an input data, as a training object for
the second convolutional neural network. And then we just output, we compute
the objective of this second CNN. And then just use TensorFlow to
differentiate it with respect to parameters. Note that here, we always used unwise
estimation of the expected values. So we always substitute expected
values with sample averages, and not some more complicated expressions,
which are not obviously unbiased or not. So here, everything is unbiased,
and on others, this stochastic approximation
of the gradient will be correct. And if you do enough iterations,
then you will converge to some good point
in your parameter space. [MUSIC]