[MUSIC] In the previous video, we completely defined our model. And now everything that is left is to understand how to maximize it, with respect to the weights of both neural networks, w and phi. So we have to maximize this kind of objective. And since it hasn't an expect value inside, we have to approximate it with Monte Carlo somehow, right. So let's look closer into the subject. First of all part is easy. Because it's just KL distance between some Gaussian with known parameters, and the standard Gaussian. So we can compute this term analytically. So although it has an integral inside, we can compute it analytically. And this expression will not cause us any trouble, both in terms of evaluating it and finding gradients with respective parameters. So we can just not think about it and let TensorFlow think about the gradients, if we define the diversions, as this kind of analytical formula. So let's look a little closer into the first term of this expression. That's called f of parameters w and phi. So this function is sum with respect to objects, of expected values of logarithm of probability. And recall that we decided that each q i of individual object, would be some distribution, which on t i, given x i and phi. Which is defined by convolutional neural networks with parameters phi. So let's re-write it as false, and let's start with looks at the gradient of this function with respect to w. So the gradient of this function with respect to w, it looks as false, so half the gradient of sum of expected values. And we'll write the expected value by the definition. So latent variable t i is continuous, and thus, the expected value is just the integral of the probability times the function, the logarithm of p of xi given t i. Now, we can move the gradient sign inside the summation. Because summing, taking the gradient do not interfere with each other, we can swap this sides. And also for smooth and nice functions, we usually can swap the integration and great gradient sides, like this. Finally, since the first function q of t i given x i and phi, it doesn't depend on w, so we can easily push the equation side even further. Because this q is just a constant with respect to w. And it doesn't affect the value of gradient, we just have to multiply the gradient of logarithm with this value. And now we can see that what we obtained is just an expected value of the gradient, right? Sum with respect to the objects in theta set, expected value of the gradient of logarithm. And we can approximate this expect failure by sampling. So we can sample one, for example, point from the [INAUDIBLE] distribution q of t i. And then put that inside the logarithm of p of x i given t i, compute its gradient, with respect to w. So basically what we're doing here is just we're passing our image through our [INAUDIBLE], to get the parameters of the variation distribution theory q of t i. Then we sample on point from the variation distribution. And then we put this point as input to the secondary network with parameters w. And then we just compute the usual gradient of this second neural network with respect to its parameters. And given that its input is this sample t i hat. So this is just the usual gradient. We can use TensorFlow to find it automatically. And finally, this thing depends on the whole data set, but we can easily approximate it with a mini bunch, right? We can write it as some constants to normalize things of some, with respect to minimization of random objects which we have chosen for this particular iteration. And this is a standard stochastic gradient for a neural network. So you don't have to think here too much, you just have to find the gradient with TensorFlow of your second part of your neural network, with respect to parameters. So the overall [INAUDIBLE] here is false. We have our function, our objective, we pass our input image through the first convolutional neural network with parameters phi. We find the parameters m and s of variation distribution q. We sample one point from this Gaussian with parameters m and s. We put this point t i hat inside the second convolutional neural network w. And then we treat this t i hat as an input data, as a training object for the second convolutional neural network. And then we just output, we compute the objective of this second CNN. And then just use TensorFlow to differentiate it with respect to parameters. Note that here, we always used unwise estimation of the expected values. So we always substitute expected values with sample averages, and not some more complicated expressions, which are not obviously unbiased or not. So here, everything is unbiased, and on others, this stochastic approximation of the gradient will be correct. And if you do enough iterations, then you will converge to some good point in your parameter space. [MUSIC]