[MUSIC] For once, in this entire course, I actually get to explain you some pieces that I really like. Let's talk about generative models. Now there are a lot of bunch of general, generic, unsupervised generative models. But unlike auto-encoders, they try to solve all the possible problems equally badly. Generative models, generative adversarial networks, for example, try to generate images specifically. And since they specialize, they get to do this thing really well. Now the problem with generating stuff is a broad one, whether you generate images, or if you generate music, voice, or well, any abstract sound recording. We'll try to generate abstract measurements of data from your favorite machine learning competition, or even complicated events from large hadron collider, but we'll study the generative models based on image generation. The reason here is kind of simple. When there is something wrong with a generative model, it's usually kind of hard to tell it, and there's no obvious way you can judge whether a model is better or worse. But with images there is an exception. And usually, it's pretty much easy to tell if a face generated by model is wrong if it has, say, three eyes. So this kind of gives you an intuitive advantage to understanding models. So, let's start to find out how to generate stuff. The pipeline here is kind of the reverse of what you usually have when classifying. In classification problem, you have an image, and then you, well, kind of reduce the image down with convolution, pooling, convolution, pooling, convolution, pooling, you know the drill. And then, you use some dense layers or well, any other deep learning architecture you like to actually try to predict the final label, and your output is not an image, but just, basically, a vector of numbers, a small one, usually. Today's pipeline is going to be the reverse. But don't get scared by this feature just yet. What you have here, well, previous slide showed you how to classify a chair into a particular type and orientation. But now, we're going to have those types and orientation as maybe some random noise or maybe some kind of other description of a chair here, and we'll try to convert this description back into original image, into pixels. So this is how you can solve this problem. You have a description of an object, try to predict an image, and you finalize your model via this pixelwise, mean squared error between the predicted pixels and the actual kind of reference chair or other object pictures. Now this kind of works, but the problem here is that this generative task is, in this case, it cannot be ideally solved mathematically. You can probably draw more than one chair that satisfies all those properties. And for more complicated problems, it's even worse. If you are tasked at generating, say, white, male, middle-aged [INAUDIBLE] the face, there's more than one face that satisfies this property. I'm sorry, I might be slightly biased when it comes to describing the face here. So the problem here is that if there is more than one way you can predict something, if there are more than one kind of correct answer, then if you try to minimize the squared error, it's going to suck. Hard. So the problem here is that if you have, say, two possible hair colors, say, blonde or dark hair, then if you try to minimize MSE, you're going to predict the average, kind of unrealistic hair. If there's a possibility to include some facial hair or no facial hair, it'll have this kind of seemingly transparent facial hair. You'll get this average image that doesn't look like anything real. So the problem here is that mean squared error is bad, like real bad. But we want to avoid this pixelwise MSE, we want to generate effectively. Otherwise, we'll only be able to get those average images, like the ones you obtain from auto-encoders. Now, the question here is kind of fun to define once I've kind of move to presentation again, but we actually want to find some efficient representation that only has a few properties, so it has to cover the higher order feature. It has to kind of have a little semantic feature, like, if there are faces, it might be a presence of some facial hair or orientation of a face, and maybe color of eyes, but it doesn't have to contain all the pixels. Distance between the two representations of images that we're going to minimize, it shouldn't be the pixelwise position of everything. Instead, it should be kind of semantic distance. So if two persons both have facial hair, same skin tone, and more or less, same face position, then the error should be narrow, even though they might be slightly shifted, for example. But the issue here is that we already have some kind of representation. We're actually obtaining those representations with other kinds of networks previously, and this representation kind of automatically happens when we try to solve the classification problem. Now, what kind of representation is this? Now, of course there is more than one correct way you can do that, but with the popular approach we're going to expand or operate on now is that you can use the space that gets trained when you beta propagate, when you train network on image classification. So if you try to classify classes in image net, and there's all kinds of stuff there, and the features you're not [INAUDIBLE] are more or less parts of those objects, and maybe kinds of textures, but they kind of contain all the semantic or high level information, especially if you go deep into the network, near the output layers. What hey do not contain is they don't contain orientation. Or position. Okay. Orientation might be slightly more important here. The trick with position here is that it doesn't actually depend on where on your image the cat is. If you try to classify it, it's still a cat. Therefore, it's convenient for a network to learn features that don't change much if your cat's position kind of changes slightly, and this is the exact kind of representation you need. So what you can do is you can take, for example, some intermediate layer, deep enough in your own network. You can use the activations of this layer, and say, squared error of those activations as your target metric. So you have your original image, and it's difficult minimizing pixelwise between your original reference image and your printed image. You could compute those kind of convolutional [INAUDIBLE] features for both images and compute the mean squared error within those high level features. So this is a very powerful approach. You'd use your previously trained classifier as another kind of specially trained metric here to train a different model. And basically, all it takes to train such a model, to employ such an approach, is a pre-trained classifier for a reasonably large classification problem adjacent to your generation problem. So if you are, say, classifying faces, it makes sense to then generate faces and vice versa. But in many domains, this kind of resource is freely available for image classification because image net, but for many other problems like with lead generation, it's much more scarce. So let's now try to find out how do we, maybe, get this intermediate network specifically to humans, for our problems, instead of just trying to borrow it from another similar problem. [MUSIC]