1 00:00:00,000 --> 00:00:03,581 [MUSIC] 2 00:00:03,581 --> 00:00:05,270 For once, in this entire course, 3 00:00:05,270 --> 00:00:08,665 I actually get to explain you some pieces that I really like. 4 00:00:08,665 --> 00:00:10,111 Let's talk about generative models. 5 00:00:10,111 --> 00:00:15,189 Now there are a lot of bunch of general, generic, unsupervised generative models. 6 00:00:15,189 --> 00:00:16,684 But unlike auto-encoders, 7 00:00:16,684 --> 00:00:19,682 they try to solve all the possible problems equally badly. 8 00:00:19,682 --> 00:00:23,145 Generative models, generative adversarial networks, for example, 9 00:00:23,145 --> 00:00:25,028 try to generate images specifically. 10 00:00:25,028 --> 00:00:29,077 And since they specialize, they get to do this thing really well. 11 00:00:29,077 --> 00:00:33,708 Now the problem with generating stuff is a broad one, whether you generate images, or 12 00:00:33,708 --> 00:00:37,653 if you generate music, voice, or well, any abstract sound recording. 13 00:00:37,653 --> 00:00:40,928 We'll try to generate abstract measurements of data from your 14 00:00:40,928 --> 00:00:45,502 favorite machine learning competition, or even complicated events from large hadron 15 00:00:45,502 --> 00:00:49,663 collider, but we'll study the generative models based on image generation. 16 00:00:49,663 --> 00:00:52,484 The reason here is kind of simple. 17 00:00:52,484 --> 00:00:55,059 When there is something wrong with a generative model, 18 00:00:55,059 --> 00:00:56,974 it's usually kind of hard to tell it, and 19 00:00:56,974 --> 00:01:00,340 there's no obvious way you can judge whether a model is better or worse. 20 00:01:00,340 --> 00:01:02,034 But with images there is an exception. 21 00:01:02,034 --> 00:01:06,992 And usually, it's pretty much easy to tell if a face generated by model is wrong 22 00:01:06,992 --> 00:01:08,852 if it has, say, three eyes. 23 00:01:08,852 --> 00:01:12,337 So this kind of gives you an intuitive advantage to understanding models. 24 00:01:12,337 --> 00:01:15,116 So, let's start to find out how to generate stuff. 25 00:01:15,116 --> 00:01:19,248 The pipeline here is kind of the reverse of what you usually have when classifying. 26 00:01:19,248 --> 00:01:22,026 In classification problem, you have an image, and then you, well, 27 00:01:22,026 --> 00:01:24,994 kind of reduce the image down with convolution, pooling, convolution, 28 00:01:24,994 --> 00:01:27,276 pooling, convolution, pooling, you know the drill. 29 00:01:27,276 --> 00:01:32,002 And then, you use some dense layers or well, any other deep learning architecture 30 00:01:32,002 --> 00:01:36,187 you like to actually try to predict the final label, and your output is not 31 00:01:36,187 --> 00:01:40,741 an image, but just, basically, a vector of numbers, a small one, usually. 32 00:01:40,741 --> 00:01:42,385 Today's pipeline is going to be the reverse. 33 00:01:42,385 --> 00:01:44,778 But don't get scared by this feature just yet. 34 00:01:44,778 --> 00:01:49,650 What you have here, well, previous slide showed you how to classify 35 00:01:49,650 --> 00:01:53,104 a chair into a particular type and orientation. 36 00:01:53,104 --> 00:01:57,761 But now, we're going to have those types and orientation as maybe some random 37 00:01:57,761 --> 00:02:01,761 noise or maybe some kind of other description of a chair here, and 38 00:02:01,761 --> 00:02:06,734 we'll try to convert this description back into original image, into pixels. 39 00:02:06,734 --> 00:02:09,289 So this is how you can solve this problem. 40 00:02:09,289 --> 00:02:14,968 You have a description of an object, try to predict an image, and you finalize your 41 00:02:14,968 --> 00:02:20,406 model via this pixelwise, mean squared error between the predicted pixels and 42 00:02:20,406 --> 00:02:24,647 the actual kind of reference chair or other object pictures. 43 00:02:24,647 --> 00:02:30,007 Now this kind of works, but the problem here is that this generative task is, 44 00:02:30,007 --> 00:02:34,224 in this case, it cannot be ideally solved mathematically. 45 00:02:34,224 --> 00:02:38,152 You can probably draw more than one chair that satisfies all those properties. 46 00:02:38,152 --> 00:02:40,719 And for more complicated problems, it's even worse. 47 00:02:40,719 --> 00:02:45,863 If you are tasked at generating, say, white, male, middle-aged [INAUDIBLE] 48 00:02:45,863 --> 00:02:50,414 the face, there's more than one face that satisfies this property. 49 00:02:50,414 --> 00:02:53,933 I'm sorry, I might be slightly biased when it comes to describing the face here. 50 00:02:53,933 --> 00:02:56,714 So the problem here is that if there is more than one 51 00:02:56,714 --> 00:03:01,286 way you can predict something, if there are more than one kind of correct answer, 52 00:03:01,286 --> 00:03:04,955 then if you try to minimize the squared error, it's going to suck. 53 00:03:04,955 --> 00:03:05,781 Hard. 54 00:03:05,781 --> 00:03:10,764 So the problem here is that if you have, say, two possible hair colors, 55 00:03:10,764 --> 00:03:14,848 say, blonde or dark hair, then if you try to minimize MSE, 56 00:03:14,848 --> 00:03:19,286 you're going to predict the average, kind of unrealistic hair. 57 00:03:19,286 --> 00:03:24,365 If there's a possibility to include some facial hair or no facial hair, 58 00:03:24,365 --> 00:03:28,713 it'll have this kind of seemingly transparent facial hair. 59 00:03:28,713 --> 00:03:31,737 You'll get this average image that doesn't look like anything real. 60 00:03:31,737 --> 00:03:36,820 So the problem here is that mean squared error is bad, like real bad. 61 00:03:36,820 --> 00:03:41,244 But we want to avoid this pixelwise MSE, we want to generate effectively. 62 00:03:41,244 --> 00:03:45,227 Otherwise, we'll only be able to get those average images, 63 00:03:45,227 --> 00:03:48,300 like the ones you obtain from auto-encoders. 64 00:03:48,300 --> 00:03:52,755 Now, the question here is kind of fun to define once I've kind of move to 65 00:03:52,755 --> 00:03:57,885 presentation again, but we actually want to find some efficient representation 66 00:03:57,885 --> 00:04:02,677 that only has a few properties, so it has to cover the higher order feature. 67 00:04:02,677 --> 00:04:06,810 It has to kind of have a little semantic feature, like, if there are faces, 68 00:04:06,810 --> 00:04:10,425 it might be a presence of some facial hair or orientation of a face, 69 00:04:10,425 --> 00:04:14,392 and maybe color of eyes, but it doesn't have to contain all the pixels. 70 00:04:14,392 --> 00:04:21,500 Distance between the two representations of images that we're going to minimize, 71 00:04:21,500 --> 00:04:26,456 it shouldn't be the pixelwise position of everything. 72 00:04:26,456 --> 00:04:27,965 Instead, it should be kind of semantic distance. 73 00:04:27,965 --> 00:04:32,594 So if two persons both have facial hair, same skin tone, and more or less, 74 00:04:32,594 --> 00:04:35,974 same face position, then the error should be narrow, 75 00:04:35,974 --> 00:04:39,654 even though they might be slightly shifted, for example. 76 00:04:39,654 --> 00:04:43,511 But the issue here is that we already have some kind of representation. 77 00:04:43,511 --> 00:04:46,967 We're actually obtaining those representations with other kinds of 78 00:04:46,967 --> 00:04:50,482 networks previously, and this representation kind of automatically 79 00:04:50,482 --> 00:04:53,363 happens when we try to solve the classification problem. 80 00:04:53,363 --> 00:04:55,677 Now, what kind of representation is this? 81 00:04:55,677 --> 00:05:00,062 Now, of course there is more than one correct way you can do that, but 82 00:05:00,062 --> 00:05:04,978 with the popular approach we're going to expand or operate on now is that you can 83 00:05:04,978 --> 00:05:09,137 use the space that gets trained when you beta propagate, 84 00:05:09,137 --> 00:05:12,865 when you train network on image classification. 85 00:05:12,865 --> 00:05:17,447 So if you try to classify classes in image net, and there's all kinds of stuff there, 86 00:05:17,447 --> 00:05:21,713 and the features you're not [INAUDIBLE] are more or less parts of those objects, 87 00:05:21,713 --> 00:05:26,170 and maybe kinds of textures, but they kind of contain all the semantic or high level 88 00:05:26,170 --> 00:05:30,719 information, especially if you go deep into the network, near the output layers. 89 00:05:30,719 --> 00:05:33,730 What hey do not contain is they don't contain orientation. 90 00:05:33,730 --> 00:05:34,676 Or position. 91 00:05:34,676 --> 00:05:35,665 Okay. 92 00:05:35,665 --> 00:05:38,021 Orientation might be slightly more important here. 93 00:05:38,021 --> 00:05:42,972 The trick with position here is that it doesn't actually depend on where on 94 00:05:42,972 --> 00:05:44,475 your image the cat is. 95 00:05:44,475 --> 00:05:46,499 If you try to classify it, it's still a cat. 96 00:05:46,499 --> 00:05:51,436 Therefore, it's convenient for a network to learn features that don't change 97 00:05:51,436 --> 00:05:55,065 much if your cat's position kind of changes slightly, and 98 00:05:55,065 --> 00:05:58,199 this is the exact kind of representation you need. 99 00:05:58,199 --> 00:06:02,404 So what you can do is you can take, for example, some intermediate layer, 100 00:06:02,404 --> 00:06:04,101 deep enough in your own network. 101 00:06:04,101 --> 00:06:07,289 You can use the activations of this layer, and say, 102 00:06:07,289 --> 00:06:10,927 squared error of those activations as your target metric. 103 00:06:10,927 --> 00:06:14,809 So you have your original image, and it's difficult minimizing pixelwise 104 00:06:14,809 --> 00:06:17,771 between your original reference image and your printed image. 105 00:06:17,771 --> 00:06:22,296 You could compute those kind of convolutional [INAUDIBLE] features for 106 00:06:22,296 --> 00:06:27,598 both images and compute the mean squared error within those high level features. 107 00:06:27,598 --> 00:06:28,818 So this is a very powerful approach. 108 00:06:28,818 --> 00:06:33,397 You'd use your previously trained classifier as another kind of specially 109 00:06:33,397 --> 00:06:36,337 trained metric here to train a different model. 110 00:06:36,337 --> 00:06:39,863 And basically, all it takes to train such a model, 111 00:06:39,863 --> 00:06:44,127 to employ such an approach, is a pre-trained classifier for 112 00:06:44,127 --> 00:06:49,871 a reasonably large classification problem adjacent to your generation problem. 113 00:06:49,871 --> 00:06:54,435 So if you are, say, classifying faces, it makes sense to then generate faces and 114 00:06:54,435 --> 00:06:55,174 vice versa. 115 00:06:55,174 --> 00:06:59,206 But in many domains, this kind of resource is freely available for 116 00:06:59,206 --> 00:07:01,798 image classification because image net, 117 00:07:01,798 --> 00:07:06,702 but for many other problems like with lead generation, it's much more scarce. 118 00:07:06,702 --> 00:07:09,589 So let's now try to find out how do we, maybe, 119 00:07:09,589 --> 00:07:14,332 get this intermediate network specifically to humans, for our problems, 120 00:07:14,332 --> 00:07:18,497 instead of just trying to borrow it from another similar problem. 121 00:07:18,497 --> 00:07:28,497 [MUSIC]