So let's see how can we improve the idea of variational inference, such that it will be applicable to our latent variable model. So again the idea of variational inference is to maximize lower bound on the thing we want to maximize actually, with respect to a constraint that says that the variational distribution Q for each object should be factorized. So product of one-dimensional distributions. And let's emphasize the fact that each object has its own individual variational distribution Q, and these distributions are not connected in any way. So, one idea we can use here is as follows. So if saying that variational distribution Q for each object factorized is not enough, let's approximate it even further. And let's say that it's a Gaussian. So not only factorized but a factorized Gaussian. This way everything should be easier. Right? So, every object has its own latent variable T_i. And this latent variable T_i will have variational distribution Q_i, which is a Gaussian with some parameters M_i and S_i, which are parameters of our model which we want to train. Then we will maximize our lower bound with respect to these parameters. So, it's a nice idea, but the problem here is that we just added a lot of parameters for each training objects. So, for example if your latent variable Q_i is 50-dimensional, so it's vector with 50 numbers, then you just added 50 numbers for the vector M_i for each object, and 50 numbers for the vector S_i for each object. So 100 numbers, 100 parameters for each training object. And if you have million of training objects, then it's not a very good idea to add like 100 million parameters to your model, just because of some approximation, right? It will probably overfeed, and it will probably be really hard to train because of this really high number of parameters. And also it's not obvious how to find these parameters, M and S, for new objects to do inference, to do some predictions or generation, because for new objects, you have to solve again some optimization problem to find these parameters, and it can be slow. Okay, so we said that approximating the variational distribution with a factorized one is not enough. Approximation of the factors of the variational distribution with Gaussian is nice, but we have too many parameters for each object, because each of these Gaussians are not connected to each other. They have separate parameters. So let's try to connect these variational distributions Q_i of individual objects. One way we can do that is to say that they are all the same. So Q_i's all equal to each other. We can do that, but it will be too restrictive, we'll not be able to train anything meaningful. Other approach here is to say that all Q_i's are the same distribution, but it depends on X_i's and weight. So let's say that each Q_i is a normal distribution, which has parameters that somehow depend on X_i. So it turns out that actually now each Q_i is different, but they all share the same parameterization. So they all share the same form. And now, even for new objects, we can easily find its variational approximation Q. We can pass this new object through the function M, and for the function S, and then find its parameters of its Gaussian. And this way, we now need to maximize our lower bound with respect to our original parameters W. And this parameter Phi, that defines the parametric way on how we convert X_i's to the parameters of the distribution. And how can we define this with this function M of X_i, and with parameters Phi. Well, as we have already discussed, convolutional neural networks are a really powerful tool to work with images, right? So let's use them here too. So now we will have a convolutional neural network with parameters Phi that looks at your original input image, for example of a cat, and then transforms it to parameters of your variational distribution. And this way, we defined how can we approximate the variational distribution Q in this form, right? Okay, so let's look closer into the object we are trying to maximize. Recall that the lower bound is, by definition, equal to the sum, with respect to the objects in the data set of expected values of sum logarithm with respect to the variation distribution Q_i, right? And recall that in the plane expectation maximization algorithm it was really hard to approximate this expected value by sampling, because the Q and this expected value used to be the true posterior distribution on the latent variable T_i. And this true posterior is complicated, and we know it up to normalization constant. So we have to use Markov chain Monte Carlo to sample from it, which is slow. But now we approximate Q with a Gaussian, with known parameters which we know how to obtain. So for any object, we can pass it through our convolutional neural network with parameters Phi, obtaining parameters M and S, and then we can easily sample from these Gaussian, from these Q, to approximate our expected value. So now again, is a low half of this intractable expected value. We can easily approximate it with sampling because sampling is now cheap, it's just sampling from Gaussians. And if we recall how the model defined, the P of X_i on T, it's actually defined by another convolutional neural network. So the overall workflow will be as follows. We started with training image X, we pass it through the first neural network with parameters Phi. We get the parameters M and S of the variational distribution Q_i. We sample from this distribution one data point, which is something random. It can be different depending on our random seat or something. And then we pass this just sampled vector of latent variable T_i into the second part of our neural network, so into the convolutional neural network with parameters W. And this CNN, this second part, outputs us the distribution on the images, and actually we will try to make this whole structure to return us the images that are as close to the input images as possible. So this thing is looks really close to something called auto encoders in neural networks, which is just a neural network which is trying to output something which is as close as possible to the input. And this model is called variational auto encoder, because in contrast to the usual auto encoders, it has some assembly inside and it has some variational approximations. And the first part of this network is called encoder because it encodes the images into latent code or into the distribution on latent code. And the second part is called decoder, because it decodes the latent code into an image. Let's look what will happen if we forget about the variance in the variational distribution q. So let's say that we set s to be always zero, okay? So for any M(X), S of X is 0. Then the variational distribution QI is actually a deterministic one. It always outputs you the main value, M of XI. And in this case, we are actually directly passing this M of X into the second part of the network, into the decoder. So this way were updating the usual autoencoder, no stochastic elements inside. So this variance in the variational distribution Q is actually something that makes this model different from the usual autoencoder. Okay, so let's look a little bit closer into the objective we're trying to maximize. So this lower band, variational lower band, it can be decomposed into a sum of two terms, because the logarithm of a product is the sum of logarithms, right? And the second term in this equation equals to minus Kullback-Leibler divergence between the variational distribution Q and the prime distribution P of Ti. Just by definition. So KL divergence is something we discussed in week two, and also week three and it's something which measures some kind of a difference between distributions. So when we maximize this minus KL we are actually trying to minimize KL so we are trying to push the variational distribution QI as close to the prior as possible. And the prior is just the standard normal, as we decided, okay? This is the second term and the first term can be interpreted as follows, if for simplicity we set all the output variances to be 1, then this log likelihood of XI given Ti is just minus euclidean distance between XI and the predicted mu of Ti. So this thing is actually a reconstruction loss. It tries to push XI as close to the reconstruction as possible. And mu of Ti is just the mean output of our neural network. So if we consider our whole variational autoencoder, it takes as input an image X, XI, and then it's our posterior mu of Ti plus some noise. And if noise is constant, then we're training this model, we're just trying to make XI as close to mu of Ti as possible which is basically the objective of the usual autoencoder. And note that we are also computing the expected failure of this reconstruction loss with respect to the QI and QI is trying to approximate the posterior distrobution of the latent variables. So we're trying to say that for the latent variables Ti that are likely to cause X, according to our approximation of QI, we want the reconstruction loss to be low. So we want for these particular sensible Ti's for this particular XI, we want the reconstruction to be accurate. And this is kind of the same, not the same but it's really close to the usual autoencoder. But the second part is what makes the difference. This Kullback-Leibler divergence, it's something that pushes the QI to be non-deterministic, to be stochastic. So if you recall the idea that if we set the QI variance to zero we get the usual autoencoder, right? But why, while training the model, will we not choose that? Because if you reduce the number of noise inside it will be easier to train. So why will it choose not to inject noise in itself? Well, because of this regularization. So this KL divergence, it will not allow QI to be very deterministic because if QI variance is zero then this KL term is just infinity and we will not choose this kind of point of parameters. This regularization forces the overall structure to have some noise inside. And also notice that because of this KL divergence, because we are forcing our QI to be close to the standard Gaussian, we may now detect outliers because if we have a usual image from the training data set or something close to the training data set, then if you pass this image through our encoder, then it will output as a distribution, QI, which is close to the standard Gaussian. Because they train it this way. Because during training we try to force all those distributions to lie close to the standard Gaussian. But for a new image which the network never saw, of some suspicious behavior or something else, the conditional neural network of the encoder never saw these kind of images, right? So it can output your distribution on Ti as far away from the Gaussian at it wants. Because it wasn't trained to make them close to Gaussian. And so by looking at the distance between the variational distribution QI and the standard Gaussian, you can understand how anomalistic this point is and you can detect outliers. And also note that it's kind of easy to generate new points, nearly to hallucinate new data in these kind of models. So, because your model is defined this way, as an integral with respect to P of T, you can make a new point, a new image in two steps. First of all, sample Ti from the prior, from the standard normal and then just pass this sample from the standard Gaussian through your decoder network to decode your latent code into an image, and you will get some new samples of a fake silly picture or a fake ad or something.