1 00:00:00,000 --> 00:00:03,500 The cost function of the neural style transfer algorithm had 2 00:00:03,500 --> 00:00:07,975 a content cost component and a style cost component. 3 00:00:07,975 --> 00:00:11,455 Let's start by defining the content cost component. 4 00:00:11,455 --> 00:00:17,595 Remember that this is the overall cost function of the neural style transfer algorithm. 5 00:00:17,595 --> 00:00:21,660 So, let's figure out what should the content cost function be. 6 00:00:21,660 --> 00:00:26,380 Let's say that you use hidden layer l to compute the content cost. 7 00:00:26,380 --> 00:00:30,920 If l is a very small number, if you use hidden layer one, 8 00:00:30,920 --> 00:00:34,440 then it will really force your generated image 9 00:00:34,440 --> 00:00:37,875 to pixel values very similar to your content image. 10 00:00:37,875 --> 00:00:39,735 Whereas, if you use a very deep layer, 11 00:00:39,735 --> 00:00:41,260 then it's just asking, "Well, 12 00:00:41,260 --> 00:00:43,040 if there is a dog in your content image, 13 00:00:43,040 --> 00:00:46,150 then make sure there is a dog somewhere in your generated image. " 14 00:00:46,150 --> 00:00:50,310 So in practice, layer l chosen somewhere in between. 15 00:00:50,310 --> 00:00:53,015 It's neither too shallow nor too deep in the neural network. 16 00:00:53,015 --> 00:00:55,780 And because you plan this yourself, 17 00:00:55,780 --> 00:00:58,765 in the problem exercise that you did at the end of this week, 18 00:00:58,765 --> 00:01:01,260 I'll leave you to gain some intuitions with 19 00:01:01,260 --> 00:01:04,475 the concrete examples in the problem exercise as well. 20 00:01:04,475 --> 00:01:06,810 But usually, I was chosen to be somewhere 21 00:01:06,810 --> 00:01:09,080 in the middle of the layers of the neural network, 22 00:01:09,080 --> 00:01:12,170 neither too shallow nor too deep. 23 00:01:12,170 --> 00:01:15,285 What you can do is then use a pre-trained ConvNet, 24 00:01:15,285 --> 00:01:17,317 maybe a VGG network, 25 00:01:17,317 --> 00:01:20,020 or could be some other neural network as well. 26 00:01:20,020 --> 00:01:22,050 And now, you want to measure, 27 00:01:22,050 --> 00:01:26,160 given a content image and given a generated image, 28 00:01:26,160 --> 00:01:29,688 how similar are they in content. 29 00:01:29,688 --> 00:01:31,540 So let's let this 30 00:01:31,540 --> 00:01:39,900 a_superscript_[l](c) and this be the activations of layer l on these two images, 31 00:01:39,900 --> 00:01:42,814 on the images C and G. So, 32 00:01:42,814 --> 00:01:47,020 if these two activations are similar, 33 00:01:47,020 --> 00:01:52,602 then that would seem to imply that both images have similar content. 34 00:01:52,602 --> 00:01:54,855 So, what we'll do is define 35 00:01:54,855 --> 00:02:01,510 J_content(C,G) as just how 36 00:02:01,510 --> 00:02:05,345 soon or how different are these two activations. 37 00:02:05,345 --> 00:02:08,320 So, we'll take the element-wise difference between 38 00:02:08,320 --> 00:02:12,200 these hidden unit activations in layer l, 39 00:02:12,200 --> 00:02:14,710 between when you pass in the content image compared 40 00:02:14,710 --> 00:02:17,736 to when you pass in the generated image, 41 00:02:17,736 --> 00:02:19,955 and take that squared. 42 00:02:19,955 --> 00:02:23,760 And you could have a normalization constant in front or not, 43 00:02:23,760 --> 00:02:25,535 so it's just one of the two or something else. 44 00:02:25,535 --> 00:02:31,935 It doesn't really matter since this can be adjusted as well by this hyperparameter alpha. 45 00:02:31,935 --> 00:02:37,070 So, just be clear 46 00:02:37,070 --> 00:02:42,635 on using this notation as if both of these have been unrolled into vectors, 47 00:02:42,635 --> 00:02:47,975 so then, this becomes the square root of the l_2 norm between this and this, 48 00:02:47,975 --> 00:02:51,680 after you've unrolled them both into vectors. 49 00:02:51,680 --> 00:02:54,492 There's really just the element-wise sum of 50 00:02:54,492 --> 00:02:59,480 squared differences between these two activation. 51 00:02:59,480 --> 00:03:03,000 But it's really just the element-wise sum of 52 00:03:03,000 --> 00:03:06,150 squares of differences between the activations in layer l, 53 00:03:06,150 --> 00:03:11,850 between the images in C and G. And so, 54 00:03:11,850 --> 00:03:17,100 when later you perform gradient descent on J_of_G to try to find a value of G, 55 00:03:17,100 --> 00:03:19,740 so that the overall cost is low, 56 00:03:19,740 --> 00:03:23,120 this will incentivize the algorithm to find an image G, 57 00:03:23,120 --> 00:03:29,203 so that these hidden layer activations are similar to what you got for the content image. 58 00:03:29,203 --> 00:03:33,985 So, that's how you define the content cost function for the neural style transfer. 59 00:03:33,985 --> 00:03:37,000 Next, let's move on to the style cost function.