Let's start with discussing a problem of fitting a distribution P of X into a data-set of points. Why? Well, we have already discussed this problem in week one, when we discussed how to fit a Gaussian to a data-set of points, we discussed it in week two, when we discussed clustering problem, and how we can solve it by fitting the Gaussian mixture model into our data. And also, we discussed probabilistic PC which is kind of an infinite mixture of Gaussians. But now, we will want to return to this question because it turns out that the methods we covered, like Gaussian or Gaussian mixture model on the probabilistic PC, are not enough to capture the complicated objects like images, like natural images. So, you may want to fit your data-set of natural images into a probabilistic distribution, for example, to generate new data. And, if you try to do that with Gaussian mixture model, it will work, but it will not work as well as some more sophisticated models we will discuss this week. And so, in this example for example, we generated some fake celebrity faces by using a generative model, and you can do these kinds of things if you have a probability distribution of your training data, so you can sample new images from this distribution. And also you can, if you have such a model, like P of X, you can also do a kind of Photoshop of the future applications, like here. So you can, with a few brush strokes, you can change a few pixels in your image, and the program will try to recolor everything else, so the picture will stay for the realistic. So, it will change the color of the hair and etc. So, one more reason to try to fit distribution P of X into some complicated structured data like images, is to detect anomalies. So, for example, you have a bank and you have a sequence of transactions, and then, if you fit your probabilistic model into this sequence of transactions, for a new transaction you can predict how probable this transaction is according to our model, our current training data-set, and if this particular transaction is not very probable, then we may say that it's kind of suspicious and we may ask humans to check it. And also, for example, if you have security camera footage, you can train the model on your normal day security camera, and then, if something suspicious happens then you can detect that by seeing that some images from your cameras have a low probability P of your image according to your model. So, you can detect anomalies, detect suspicious behavior. And, one more reason is, you may want to handle missing data. For example, you have some images with obscured parts, and you want to do predictions. In this case, if you have P of X, so probability of your data, it will help you greatly to deal with it. And finally, sometimes people try to represent some highly structured data in low dimensional embeddings. And, this is not some inherent property for modeling data with P of X but, as we will see in the models we will cover, it kind of comes naturally. So, it will give a latent code to any object it sees, and then we can use this latent code to explore the space of our objects kind of nicely. So, for example, people sometimes build these kind of latent codes for molecules and then try to discover new drugs by exploring this space of molecules in this latent space. Okay, so, let's say we're convinced, we want to model P of X of some natural images or other types of structured data. How can we do it? Well, probably the most natural approach is to say that, let's use a convolutional neural network because it's something that works really well for images. And, let's say that our convolution neural network will look at the image, and then return your probability of this image, right? It will like, it's the simplest possible parametric model of something that returns your probability for any image. And, to make things more stable, let's say that CNN will actually return your logarithm of probability. The problem with this approach is that you have to normalize your distribution. You have to make your distribution to sum up to one, with respect to sum according to all possible images in the world, and there are billions of them. So, this normalization constant is very expensive to compute, and you have to compute it to do the training or inference in the proper manner. So, this thing is infeasible. You can't do that because of normalization. So, what else can you do? Well, you can use the chain rule. If you recall from week one, any probabilistic distribution can be decomposed into a product of some condition distributions, and we can apply it to natural images, for example, like this. So, we have an image, in this case, it's a three by three pixel image, but of course, in a practical situation you will use 100 by 100 for example, or an even more high resolution image. And, you can enumerate each pixel of this image somehow, like for example, a row by row fashion and then you can say that the distribution of this whole image is the same as the joint distribution of the pixels. And, this joint distribution decomposes into the product of conditional distributions by the chain rule. So, the distribution on the whole image equals to the probability of the first pixel, marginal probability, plus the probability of the second pixel given the first one, and etc. And, now you may try to build these kind of conditional probability models to model your overall joint probability. And, if your model for conditional probability is flexible enough, you will not lose anything because you can represent in this way any probability distribution. And, the natural idea how to represent these conditional probabilities is with recurrent neural network. Which basically will read your image pixel by pixel, and then outputs your prediction for the next pixel. Prediction for brightness for next pixel for example, and this approach makes modeling much easier because now normalization constant has to think only about one dimensional distribution. So, if for example, your image is grayscale, then each pixel can be decoded with the number from zero to 255. So, the brightness level, and then your normalization constant can be computed by just summing up with respect to all these 256 values, so it's easy. It's a really nice approach, check it out if you have time, but a downside of that is you have to generate your new images one pixel at a time. So, if you want to generate a new image you have to first of all generate X1 from the marginal distribution X1, then you will feed this just generated X1 into the RNN, it will output your distribution on the next pixel and etc. So, no matter how many computers you have, one high resolution image can take like minutes which is really long. And so, we may want to look at something else. One more thing you can do, is to say that your distribution over pixels is independent. So, each pixel is independent of the others. In this case, you can easily feed this kind of distribution into your data, but it turns out to be a too restrictive assumption to say about your data. So, even in this simple example of a data-set of handwritten digits, if you have like 10,000 of these small images, and you train this kind of factorised model on them, you will get really not nice looking samples like this. That's because the assumption that each pixel is independent of the others is really not held on true data. So, for example, if you saw one half of the image, you can probably restore the other half quite accurately which means that they're not independent. So, this assumption is too restrictive. One more thing you can do is, you can use Gaussian mixture model. And, this thing is like really flexible in theory, it can represent any probability distribution, but in practice for complicated data like natural images, it can be really inefficient. So, we will have to use maybe thousands of Gaussians, of components, and in this case the overall methods will fail to capture the structure because it'll be too hard to train it. One more thing we can try is an infinite mixture of Gaussians like the probabilistic B, C, E, methods we covered in week two. So, here the idea is that each object, each image X has a corresponding latent variable T, and the image X is caused by this T, so we can marginalize out T. And, the conditional distribution X given T is a Gaussian. So, we kind of have a mixture of infinitely many Gaussians, for each value of T, there's one Gaussian and we mix them with weights. Note here that, even if the Gaussians are factorized, so they have independent components for each dimension, the mixture is not. So, this is a little bit more powerful model than the Gaussian mixture model, and we will discuss in the next videos how we can make it even more powerful by using neural networks inside this model.