[MUSIC] For once, in this entire course, I actually get to explain you
some pieces that I really like. Let's talk about generative models. Now there are a lot of bunch of general,
generic, unsupervised generative models. But unlike auto-encoders, they try to solve all the possible
problems equally badly. Generative models, generative
adversarial networks, for example, try to generate images specifically. And since they specialize,
they get to do this thing really well. Now the problem with generating stuff is a
broad one, whether you generate images, or if you generate music, voice, or
well, any abstract sound recording. We'll try to generate abstract
measurements of data from your favorite machine learning competition, or
even complicated events from large hadron collider, but we'll study the generative
models based on image generation. The reason here is kind of simple. When there is something wrong
with a generative model, it's usually kind of hard to tell it, and there's no obvious way you can judge
whether a model is better or worse. But with images there is an exception. And usually, it's pretty much easy to tell
if a face generated by model is wrong if it has, say, three eyes. So this kind of gives you an intuitive
advantage to understanding models. So, let's start to find
out how to generate stuff. The pipeline here is kind of the reverse
of what you usually have when classifying. In classification problem,
you have an image, and then you, well, kind of reduce the image down with
convolution, pooling, convolution, pooling, convolution,
pooling, you know the drill. And then, you use some dense layers or
well, any other deep learning architecture you like to actually try to predict
the final label, and your output is not an image, but just, basically,
a vector of numbers, a small one, usually. Today's pipeline is
going to be the reverse. But don't get scared by
this feature just yet. What you have here, well,
previous slide showed you how to classify a chair into a particular type and
orientation. But now, we're going to have those types
and orientation as maybe some random noise or maybe some kind of other
description of a chair here, and we'll try to convert this description
back into original image, into pixels. So this is how you can solve this problem. You have a description of an object, try
to predict an image, and you finalize your model via this pixelwise, mean squared
error between the predicted pixels and the actual kind of reference chair or
other object pictures. Now this kind of works, but the problem
here is that this generative task is, in this case, it cannot be
ideally solved mathematically. You can probably draw more than one chair
that satisfies all those properties. And for more complicated problems,
it's even worse. If you are tasked at generating, say,
white, male, middle-aged [INAUDIBLE] the face, there's more than one
face that satisfies this property. I'm sorry, I might be slightly biased when
it comes to describing the face here. So the problem here is that
if there is more than one way you can predict something, if there
are more than one kind of correct answer, then if you try to minimize
the squared error, it's going to suck. Hard. So the problem here is that if you have,
say, two possible hair colors, say, blonde or dark hair,
then if you try to minimize MSE, you're going to predict the average,
kind of unrealistic hair. If there's a possibility to include
some facial hair or no facial hair, it'll have this kind of seemingly
transparent facial hair. You'll get this average image that
doesn't look like anything real. So the problem here is that mean
squared error is bad, like real bad. But we want to avoid this pixelwise MSE,
we want to generate effectively. Otherwise, we'll only be able
to get those average images, like the ones you obtain
from auto-encoders. Now, the question here is kind of fun
to define once I've kind of move to presentation again, but we actually want
to find some efficient representation that only has a few properties, so
it has to cover the higher order feature. It has to kind of have a little semantic
feature, like, if there are faces, it might be a presence of some facial
hair or orientation of a face, and maybe color of eyes, but
it doesn't have to contain all the pixels. Distance between the two representations
of images that we're going to minimize, it shouldn't be the pixelwise
position of everything. Instead, it should be kind
of semantic distance. So if two persons both have facial hair,
same skin tone, and more or less, same face position,
then the error should be narrow, even though they might be
slightly shifted, for example. But the issue here is that we already
have some kind of representation. We're actually obtaining those
representations with other kinds of networks previously, and
this representation kind of automatically happens when we try to solve
the classification problem. Now, what kind of representation is this? Now, of course there is more than
one correct way you can do that, but with the popular approach we're going to
expand or operate on now is that you can use the space that gets trained
when you beta propagate, when you train network
on image classification. So if you try to classify classes in image
net, and there's all kinds of stuff there, and the features you're not [INAUDIBLE]
are more or less parts of those objects, and maybe kinds of textures, but they kind
of contain all the semantic or high level information, especially if you go deep
into the network, near the output layers. What hey do not contain is they
don't contain orientation. Or position. Okay. Orientation might be slightly
more important here. The trick with position here is that
it doesn't actually depend on where on your image the cat is. If you try to classify it,
it's still a cat. Therefore, it's convenient for a network
to learn features that don't change much if your cat's position
kind of changes slightly, and this is the exact kind of
representation you need. So what you can do is you can take,
for example, some intermediate layer, deep enough in your own network. You can use the activations of this layer,
and say, squared error of those activations
as your target metric. So you have your original image, and
it's difficult minimizing pixelwise between your original reference image and
your printed image. You could compute those kind of
convolutional [INAUDIBLE] features for both images and compute the mean squared
error within those high level features. So this is a very powerful approach. You'd use your previously trained
classifier as another kind of specially trained metric here to
train a different model. And basically,
all it takes to train such a model, to employ such an approach,
is a pre-trained classifier for a reasonably large classification problem
adjacent to your generation problem. So if you are, say, classifying faces,
it makes sense to then generate faces and vice versa. But in many domains, this kind of
resource is freely available for image classification because image net, but for many other problems like with
lead generation, it's much more scarce. So let's now try to find out how do we,
maybe, get this intermediate network
specifically to humans, for our problems, instead of just trying to borrow
it from another similar problem. [MUSIC]