1 00:00:00,025 --> 00:00:03,303 [SOUND]. 2 00:00:03,303 --> 00:00:08,490 For example, you could try to classify images for your usual means problems. 3 00:00:08,490 --> 00:00:13,029 This time you're not that interested in classifying digits in mnist if you want to 4 00:00:13,029 --> 00:00:15,810 classify digits on house numbers, for example. 5 00:00:15,810 --> 00:00:19,580 But of course, the problem with house numbers is, let's say you have a label 6 00:00:19,580 --> 00:00:22,944 data set with mnist, but no one gave you a labeled house numbers yet, 7 00:00:22,944 --> 00:00:25,050 let's just pretend this doesn't exist. 8 00:00:26,190 --> 00:00:29,856 Of course this is a toy problem, but the same thing arises when you are trying 9 00:00:29,856 --> 00:00:32,319 to for example, you have an image classifier, but 10 00:00:32,319 --> 00:00:35,490 you want to apply it to classify images from your social network. 11 00:00:35,490 --> 00:00:39,568 So you go from slightly different set of photo cameras, 12 00:00:39,568 --> 00:00:44,620 with different set of brands, just for different content, maybe. 13 00:00:44,620 --> 00:00:49,230 And you want your network to be as good on this changed kind of domain 14 00:00:49,230 --> 00:00:52,454 on this new data set as it was on the originally labeled one. 15 00:00:53,720 --> 00:00:58,377 Now, of course you could just stop earlier and maybe somehow validate, but 16 00:00:58,377 --> 00:01:03,720 let's see how we can improve over this classical approach with adversarial. 17 00:01:03,720 --> 00:01:07,898 Now your original task is owned by a classifier or redresser like this one. 18 00:01:07,898 --> 00:01:12,033 Let's split this model into two parts with the first part, well, the left one tries 19 00:01:12,033 --> 00:01:16,220 to extract features, and the second part uses them to predict something. 20 00:01:16,220 --> 00:01:20,680 This division is, however, really arbitrary, so there's no specific 21 00:01:20,680 --> 00:01:25,288 boundary to split this model at, you can pick arbitrary one. 22 00:01:25,288 --> 00:01:29,950 And the idea here is that the whole model is usually trained via back propagation. 23 00:01:29,950 --> 00:01:34,900 Now if you want to prevent this model from overfitting to your 24 00:01:34,900 --> 00:01:39,240 particular domain, let's try to apply this image to those features. 25 00:01:41,090 --> 00:01:45,027 Here, there is this purple network, which is our discriminator, 26 00:01:45,027 --> 00:01:47,594 it looks not at the intermediate features. 27 00:01:47,594 --> 00:01:51,651 It tries to judge how the model sees the world, and 28 00:01:51,651 --> 00:01:57,449 it tries to distinguish between those features as your model processes 29 00:01:57,449 --> 00:02:03,260 the initial training set of images, and the target domain of images. 30 00:02:03,260 --> 00:02:08,120 So it basically tries to see whether there is any difference between your model 31 00:02:08,120 --> 00:02:13,050 behaving on training objects, and on those kind of out-of-domain, 32 00:02:13,050 --> 00:02:15,379 the target domain, social network images, for example. 33 00:02:16,690 --> 00:02:20,050 Now the question is, let's say that your purple network succeeds. 34 00:02:20,050 --> 00:02:25,331 It reaches almost 100% accuracy in telling whether your network is currently 35 00:02:25,331 --> 00:02:30,317 processing training image, or it's trying to press the validation image. 36 00:02:30,317 --> 00:02:31,960 The question is, is it good or bad? 37 00:02:33,010 --> 00:02:38,100 Well, yeah, exactly, if your model is able to distinguish simply by looking 38 00:02:38,100 --> 00:02:42,894 at features what kind of image it is, it means the representation your neural 39 00:02:42,894 --> 00:02:47,380 network learned is different for, well, for training and validation images. 40 00:02:47,380 --> 00:02:50,210 If something is different from training and validation, it's usually a bad sign. 41 00:02:50,210 --> 00:02:52,769 In this case, it's a really bad sign, it means your model overfits. 42 00:02:53,810 --> 00:02:58,739 Aside from its regional loss, this L classifier here, you also add those kinds 43 00:02:58,739 --> 00:03:03,980 of adversarial components, yes, this adversarial ratio of probability of real. 44 00:03:03,980 --> 00:03:07,291 And this basically means that you want to train those features, 45 00:03:07,291 --> 00:03:09,940 this kind of left part of your classifier network. 46 00:03:09,940 --> 00:03:14,433 In order to make it indistinguishable between how it operates on the training 47 00:03:14,433 --> 00:03:17,280 data, and how it operates on the target domain. 48 00:03:18,450 --> 00:03:23,160 And again, you train those two models simultaneously, 49 00:03:23,160 --> 00:03:26,250 you try to optimize the kind of mixed objective at the classifier. 50 00:03:26,250 --> 00:03:32,086 And of course you can slightly tune it by scaling the adversarial in the classifier 51 00:03:32,086 --> 00:03:38,830 by a multiplicative constant, say, kind of a regularization factor, if you wish. 52 00:03:38,830 --> 00:03:43,090 And this way, you can obtain a model that tries to adapt toward the main for 53 00:03:43,090 --> 00:03:45,090 which you don't even need to have any labels. 54 00:03:45,090 --> 00:03:49,460 So it doesn't actually need to have a label of social network images, 55 00:03:49,460 --> 00:03:51,020 so labeled house numbers. 56 00:03:51,020 --> 00:03:56,711 What it needs is labels or labeled image net. 57 00:03:56,711 --> 00:04:00,250 Hence, unlabeled data from your target domain, so basically, 58 00:04:00,250 --> 00:04:02,130 this is a very powerful idea here. 59 00:04:02,130 --> 00:04:06,046 And since I'm promoting the idea that deep learning is a kind of language you speak 60 00:04:06,046 --> 00:04:09,809 to your machine learning model to describe what you actually want it to learn. 61 00:04:09,809 --> 00:04:13,538 This kind of adversarial approach gives you the words of power, 62 00:04:13,538 --> 00:04:15,760 which is called indistinguishable. 63 00:04:15,760 --> 00:04:21,040 So if you want some kind of behavior indistinguishable between one case and 64 00:04:21,040 --> 00:04:23,290 another case, you can train the discriminator, and 65 00:04:23,290 --> 00:04:26,200 try to optimize it in an adversarial manner. 66 00:04:27,420 --> 00:04:33,440 So this was the element called applicational adversarial networks, but 67 00:04:33,440 --> 00:04:38,570 just to unwind, let's try to cover some of the more fancy aspects here. 68 00:04:39,620 --> 00:04:44,610 You probably all the cool artificial intelligent apps called prisma, 69 00:04:44,610 --> 00:04:46,558 or acmon app, or maybe artistA. 70 00:04:46,558 --> 00:04:51,750 The Prisma is probably an overwhelming favorite here. 71 00:04:51,750 --> 00:04:56,420 The idea is that those apps try to morph your image in a way that 72 00:04:56,420 --> 00:05:00,172 follows the artistic style of a particular painting, 73 00:05:00,172 --> 00:05:04,880 or maybe a particular style of art, like impressionism, for example. 74 00:05:04,880 --> 00:05:09,315 And you do this by, so far, magically 75 00:05:09,315 --> 00:05:14,270 inserting your image in the super mega image box and waiting for a minute. 76 00:05:14,270 --> 00:05:20,619 Let's cover the math, the nuts and bolts of how you actually do that. 77 00:05:20,619 --> 00:05:24,330 Again, you have to make some representation of model 78 00:05:24,330 --> 00:05:26,760 indistinguishable here. 79 00:05:26,760 --> 00:05:31,074 You won't need the specified trainable discriminator, but you want to somehow 80 00:05:31,074 --> 00:05:35,273 obtain an image representation that only preserves this style information. 81 00:05:35,273 --> 00:05:38,418 So you want to define the style for an image, 82 00:05:38,418 --> 00:05:43,900 in a way that the representation you get only covers style but not content. 83 00:05:43,900 --> 00:05:46,069 So basically, you have to mimic style, but 84 00:05:46,069 --> 00:05:48,310 you have to preserve the content of an image. 85 00:05:48,310 --> 00:05:51,960 If you have a selfie, you want to still see your face on it, 86 00:05:51,960 --> 00:05:56,280 but the style, the texture, should be like Monet, or something similar. 87 00:05:58,290 --> 00:05:58,980 This is, again, 88 00:05:58,980 --> 00:06:02,175 a non-mathematical problem, or well, heuristical, if you wish. 89 00:06:02,175 --> 00:06:07,320 You could try to define this art style by trying to take a pre-trained network, 90 00:06:07,320 --> 00:06:09,830 from image network models, for example. 91 00:06:09,830 --> 00:06:13,300 And taking some kind of representation from this network, 92 00:06:13,300 --> 00:06:15,360 maybe, that only captures local information. 93 00:06:15,360 --> 00:06:19,290 Of course you won't be able to just take it, you'll have to compute something that 94 00:06:19,290 --> 00:06:21,910 preserves the kind of low level text information. 95 00:06:21,910 --> 00:06:26,510 But that throws away all the higher order features, like what's on the image. 96 00:06:26,510 --> 00:06:28,910 Now, is there some way you can think of this transformation? 97 00:06:30,580 --> 00:06:34,210 And there is of course more than one way you could do that, and 98 00:06:34,210 --> 00:06:37,840 it's likely that at least some of you managed to land on something 99 00:06:37,840 --> 00:06:40,350 greater than the idea we're going to cover right now. 100 00:06:40,350 --> 00:06:42,040 But what you could do, at least, 101 00:06:42,040 --> 00:06:47,627 is you could take filters from some lower layers in pretrained. 102 00:06:47,627 --> 00:06:50,580 So some kind of, not too deep, or shallow enough so 103 00:06:50,580 --> 00:06:55,720 that the filters only catch texture and super small image details. 104 00:06:55,720 --> 00:06:58,650 And you can try to either average over 105 00:06:58,650 --> 00:07:00,804 the whole image like global average [INAUDIBLE]. 106 00:07:00,804 --> 00:07:03,736 We're trying to computer the gram matrix over this kind of two-dimensional 107 00:07:03,736 --> 00:07:05,050 activation map. 108 00:07:05,050 --> 00:07:10,600 And the intuition here, if you don't know the math of gram matrixes, 109 00:07:10,600 --> 00:07:12,860 you can try to explain it the following way. 110 00:07:12,860 --> 00:07:18,599 You try to compute how frequently do texture features 111 00:07:18,599 --> 00:07:24,222 kind of coexist, coincide at adjacent locations. 112 00:07:24,222 --> 00:07:28,193 And use this descriptor, you just compute this for all features, and 113 00:07:28,193 --> 00:07:31,154 you use these metrics as kind of a style descriptor, 114 00:07:31,154 --> 00:07:33,940 you use this as a representation of an art style. 115 00:07:33,940 --> 00:07:38,863 Now you could compute this cell descriptor for your kind of reference image, 116 00:07:38,863 --> 00:07:41,626 say fiery nights or some Monet paintings. 117 00:07:41,626 --> 00:07:47,795 And then you could try to compute the same descriptor for your selfie, for example. 118 00:07:47,795 --> 00:07:50,947 Now [INAUDIBLE] size is going to be pretty different, 119 00:07:50,947 --> 00:07:56,960 because your selfie isn't obviously a painting, it was not painted by a brush. 120 00:07:56,960 --> 00:07:59,760 But the idea here is that those two representations, those two descriptors, 121 00:07:59,760 --> 00:08:01,150 when they are different, 122 00:08:01,150 --> 00:08:04,160 you can compute some difference between them, [INAUDIBLE] error. 123 00:08:04,160 --> 00:08:06,730 And this whole procedure is going to be differentiable, 124 00:08:06,730 --> 00:08:09,320 which is a very important part here, because, remember, 125 00:08:09,320 --> 00:08:14,590 we take filters from a differentiable neural network. 126 00:08:14,590 --> 00:08:18,380 And then we compute the gram matrix, or we just average over the whole field, 127 00:08:18,380 --> 00:08:22,450 which is kind of simpler but yields less impressive results. 128 00:08:22,450 --> 00:08:25,720 We just average or compute the matrix, and 129 00:08:25,720 --> 00:08:28,970 then we compute the differences between those two gram matrices. 130 00:08:28,970 --> 00:08:33,053 And this is in fact, well, just a set of multiplications, 131 00:08:33,053 --> 00:08:37,640 additions and maybe some if your network allows for that. 132 00:08:37,640 --> 00:08:41,747 Now if you try to then adjust your image, you take your selfie and 133 00:08:41,747 --> 00:08:45,552 you try to adjust your selfie to make its image descriptor, 134 00:08:45,552 --> 00:08:49,604 this gram matrix, similar to the one of your reference image. 135 00:08:49,604 --> 00:08:55,716 Say Fiery Night, then your selfie will slowly take on features of this painting, 136 00:08:55,716 --> 00:08:59,810 this Fiery Night painting, but not the content of it. 137 00:09:01,030 --> 00:09:05,108 Since we're just optimizing the texture so far, this is going to be, well, 138 00:09:05,108 --> 00:09:06,920 this is going to be quite inferior, 139 00:09:06,920 --> 00:09:11,520 because the image may even lose its content as it tries to optimize textures. 140 00:09:11,520 --> 00:09:16,462 So let's also add this kind of content analysis, so we want an image which 141 00:09:16,462 --> 00:09:22,400 looks like Fiery Nights or any other painting you want in terms of texture, 142 00:09:22,400 --> 00:09:26,440 which also looks like your selfie in terms of content. 143 00:09:26,440 --> 00:09:30,140 Now, where do you get content, how do you divide it from any other content? 144 00:09:31,600 --> 00:09:35,167 If you want higher level kind of [INAUDIBLE], you can just go deeper. 145 00:09:35,167 --> 00:09:39,187 You can take maybe a pre file dense layer, or some of the top conversion layers, 146 00:09:39,187 --> 00:09:40,871 depending on the architecture. 147 00:09:40,871 --> 00:09:44,420 And again, you can just skip [INAUDIBLE] is going to be perfectly differentiable. 148 00:09:44,420 --> 00:09:47,956 And then weigh it up by adding some of the multiplicative coefficients to each of 149 00:09:47,956 --> 00:09:49,530 those differences. 150 00:09:49,530 --> 00:09:51,942 And then you could minimize them over the pieces of image, so 151 00:09:51,942 --> 00:09:54,980 you're going to start with a random image, or your selfie. 152 00:09:54,980 --> 00:09:59,920 Random image is slightly better because and then you just morph 153 00:09:59,920 --> 00:10:04,393 it by following gradient direction, or any other optimization. 154 00:10:04,393 --> 00:10:07,256 I'll be following the gradient of this kind of texture dissimilarity and 155 00:10:07,256 --> 00:10:08,960 content dissimilarity. 156 00:10:08,960 --> 00:10:13,100 This builds your image which inherits the texture for the selfie, and 157 00:10:13,100 --> 00:10:16,810 the content from the, sorry, the content from the selfie and 158 00:10:16,810 --> 00:10:20,280 the texture from the painting, of course. 159 00:10:20,280 --> 00:10:22,680 Now, here's an example of how this thing actually works. 160 00:10:22,680 --> 00:10:26,630 This photo was morphed to look like, to resemble Gogh's style of painting. 161 00:10:26,630 --> 00:10:28,972 And this is of course a slightly modified, 162 00:10:28,972 --> 00:10:31,573 it's like a hacky version of the algorithm, so 163 00:10:31,573 --> 00:10:36,269 it's not just activation of one layer for textures, and another layer for content. 164 00:10:36,269 --> 00:10:40,605 It will include a model description of how do you actually, which layers do you use, 165 00:10:40,605 --> 00:10:42,131 what networks do you use, and 166 00:10:42,131 --> 00:10:45,998 what methods of optimization do you actually apply to get faster results. 167 00:10:45,998 --> 00:10:48,501 We'll include all this stuff into the reading section. 168 00:10:48,501 --> 00:10:52,347 Now we can introduce you to follow this URL here, and try taking this on yourself, 169 00:10:52,347 --> 00:10:54,432 of course if you have not done this already. 170 00:10:54,432 --> 00:10:56,623 See you in the next section. 171 00:10:56,623 --> 00:11:06,623 [MUSIC]