1
00:00:00,000 --> 00:00:02,400
In the last video, you saw how to define

2
00:00:02,400 --> 00:00:04,980
the content cost function for the neural style transfer.

3
00:00:04,980 --> 00:00:09,317
Next, let's take a look at the style cost function.

4
00:00:09,317 --> 00:00:12,373
So, what is the style of an image mean?

5
00:00:12,373 --> 00:00:14,633
Let's say you have an input image like this,

6
00:00:14,633 --> 00:00:16,750
they used to seeing a convnet like that,

7
00:00:16,750 --> 00:00:20,091
compute features that there's different layers.

8
00:00:20,091 --> 00:00:22,692
And let's say you've chosen some layer L,

9
00:00:22,692 --> 00:00:29,020
maybe that layer to define the measure of the style of an image.

10
00:00:29,020 --> 00:00:34,095
What we need to do is define the style as the correlation between

11
00:00:34,095 --> 00:00:40,635
activations across different channels in this layer L activation.

12
00:00:40,635 --> 00:00:42,190
So here's what I mean by that.

13
00:00:42,190 --> 00:00:44,480
Let's say you take that layer L activation.

14
00:00:44,480 --> 00:00:50,936
So this is going to be nh by nw by nc block of activations,

15
00:00:50,936 --> 00:00:55,995
and we're going to ask how correlated are the activations across different channels.

16
00:00:55,995 --> 00:00:59,966
So to explain what I mean by this may be slightly cryptic phrase,

17
00:00:59,966 --> 00:01:02,850
let's take this block of activations

18
00:01:02,850 --> 00:01:06,575
and let me shade the different channels by a different colors.

19
00:01:06,575 --> 00:01:08,295
So in this below example,

20
00:01:08,295 --> 00:01:14,286
we have say five channels and which is why I have five shades of color here.

21
00:01:14,286 --> 00:01:15,375
In practice, of course,

22
00:01:15,375 --> 00:01:18,514
in neural network we usually have a lot more channels than five,

23
00:01:18,514 --> 00:01:22,056
but using just five makes it drawing easier.

24
00:01:22,056 --> 00:01:24,765
But to capture the style of an image,

25
00:01:24,765 --> 00:01:26,850
what you're going to do is the following.

26
00:01:26,850 --> 00:01:28,910
Let's look at the first two channels.

27
00:01:28,910 --> 00:01:32,970
Let's see for the red channel and the yellow channel and say

28
00:01:32,970 --> 00:01:37,450
how correlated are activations in these first two channels.

29
00:01:37,450 --> 00:01:40,575
So, for example, in the lower right hand corner,

30
00:01:40,575 --> 00:01:45,820
you have some activation in the first channel and some activation in the second channel.

31
00:01:45,820 --> 00:01:47,655
So that gives you a pair of numbers.

32
00:01:47,655 --> 00:01:51,510
And what you do is look at different positions across

33
00:01:51,510 --> 00:01:55,435
this block of activations and just look at those two pairs of numbers,

34
00:01:55,435 --> 00:01:57,232
one in the first channel, the red channel,

35
00:01:57,232 --> 00:02:00,000
one in the yellow channel, the second channel.

36
00:02:00,000 --> 00:02:02,370
And you just look at these two pairs of numbers and

37
00:02:02,370 --> 00:02:04,816
see when you look across all of these positions,

38
00:02:04,816 --> 00:02:07,320
all of these nh by nw positions,

39
00:02:07,320 --> 00:02:10,205
how correlated are these two numbers.

40
00:02:10,205 --> 00:02:12,550
So, why does this capture style?

41
00:02:12,550 --> 00:02:14,405
Let's look another example.

42
00:02:14,405 --> 00:02:17,844
Here's one of the visualizations from the earlier video.

43
00:02:17,844 --> 00:02:20,280
This comes from again the paper by

44
00:02:20,280 --> 00:02:23,350
Matthew Zeiler and Rob Fergus that I have reference earlier.

45
00:02:23,350 --> 00:02:25,360
And let's say for the sake of arguments,

46
00:02:25,360 --> 00:02:28,300
that the red neuron corresponds to,

47
00:02:28,300 --> 00:02:30,170
and let's say for the sake of arguments,

48
00:02:30,170 --> 00:02:36,600
that the red channel corresponds to this neurons so we're trying to figure out if there's

49
00:02:36,600 --> 00:02:40,320
this little vertical texture in

50
00:02:40,320 --> 00:02:46,185
a particular position in the nh and let's say that this second channel,

51
00:02:46,185 --> 00:02:51,795
this yellow second channel corresponds to this neuron,

52
00:02:51,795 --> 00:02:56,515
which is vaguely looking for orange colored patches.

53
00:02:56,515 --> 00:03:01,104
What does it mean for these two channels to be highly correlated?

54
00:03:01,104 --> 00:03:04,560
Well, if they're highly correlated what that means is whatever part of

55
00:03:04,560 --> 00:03:08,430
the image has this type of subtle vertical texture,

56
00:03:08,430 --> 00:03:12,980
that part of the image will probably have these orange-ish tint.

57
00:03:12,980 --> 00:03:15,755
And what does it mean for them to be uncorrelated?

58
00:03:15,755 --> 00:03:19,635
Well, it means that whenever there is this vertical texture,

59
00:03:19,635 --> 00:03:22,625
it's probably won't have that orange-ish tint.

60
00:03:22,625 --> 00:03:25,710
And so the correlation tells you which of

61
00:03:25,710 --> 00:03:31,020
these high level texture components tend to occur or not occur together

62
00:03:31,020 --> 00:03:35,550
in part of an image and that's the degree of correlation that gives you

63
00:03:35,550 --> 00:03:40,455
one way of measuring how often these different high level features,

64
00:03:40,455 --> 00:03:45,441
such as vertical texture or this orange tint or other things as well,

65
00:03:45,441 --> 00:03:48,180
how often they occur and how often they occur

66
00:03:48,180 --> 00:03:51,740
together and don't occur together in different parts of an image.

67
00:03:51,740 --> 00:03:57,180
And so, if we use the degree of correlation between channels as a measure of the style,

68
00:03:57,180 --> 00:04:02,670
then what you can do is measure the degree to which in your generated image,

69
00:04:02,670 --> 00:04:06,810
this first channel is correlated or uncorrelated with

70
00:04:06,810 --> 00:04:12,090
the second channel and that will tell you in the generated image how often

71
00:04:12,090 --> 00:04:14,820
this type of vertical texture occurs or doesn't

72
00:04:14,820 --> 00:04:18,450
occur with this orange-ish tint and this gives you a measure

73
00:04:18,450 --> 00:04:25,675
of how similar is the style of the generated image to the style of the input style image.

74
00:04:25,675 --> 00:04:28,600
So let's now formalize this intuition.

75
00:04:28,600 --> 00:04:34,620
So what you can to do is given an image computes something called a style matrix,

76
00:04:34,620 --> 00:04:38,960
which will measure all those correlations we talks about on the last slide.

77
00:04:38,960 --> 00:04:44,280
So, more formally, let's let a superscript l, subscript i,

78
00:04:44,280 --> 00:04:47,868
j,k denote the activation at position i,j,k in

79
00:04:47,868 --> 00:04:53,610
hidden layer l. So i indexes into the height,

80
00:04:53,610 --> 00:04:54,850
j indexes into the width,

81
00:04:54,850 --> 00:04:58,050
and k indexes across the different channels.

82
00:04:58,050 --> 00:05:00,045
So, in the previous slide,

83
00:05:00,045 --> 00:05:05,165
we had five channels that k will index across those five channels.

84
00:05:05,165 --> 00:05:09,635
So what the style matrix will do is you're going to compute a matrix clauses

85
00:05:09,635 --> 00:05:17,390
G superscript square bracketed l. This is going to be an nc by nc dimensional matrix,

86
00:05:17,390 --> 00:05:18,755
so it'd be a square matrix.

87
00:05:18,755 --> 00:05:23,390
Remember you have nc channels and so you have an

88
00:05:23,390 --> 00:05:29,490
nc by nc dimensional matrix in order to measure how correlated each pair of them is.

89
00:05:29,490 --> 00:05:32,585
So particular G, l, k,

90
00:05:32,585 --> 00:05:36,954
k prime will measure how correlated are the activations in

91
00:05:36,954 --> 00:05:41,755
channel k compared to the activations in channel k prime.

92
00:05:41,755 --> 00:05:46,250
Well here, k and k prime will range from 1 through nc,

93
00:05:46,250 --> 00:05:49,630
the number of channels they're all up in that layer.

94
00:05:49,630 --> 00:05:55,820
So more formally, the way you compute G,

95
00:05:55,820 --> 00:06:00,840
l and I'm just going to write down the formula for computing one elements.

96
00:06:00,840 --> 00:06:03,283
So the k, k prime elements of this.

97
00:06:03,283 --> 00:06:06,210
This is going to be sum of a i,

98
00:06:06,210 --> 00:06:08,987
sum of a j,

99
00:06:08,987 --> 00:06:13,979
of deactivation and that layer i, j,

100
00:06:13,979 --> 00:06:22,078
k times the activation at i, j, k prime.

101
00:06:22,078 --> 00:06:27,989
So, here, remember i and j index across to a different positions in the block,

102
00:06:27,989 --> 00:06:30,453
indexes over the height and width.

103
00:06:30,453 --> 00:06:39,755
So i is the sum from one to nh and j is a sum from one to nw

104
00:06:39,755 --> 00:06:45,200
and k here and k prime index over the channel so

105
00:06:45,200 --> 00:06:47,870
k and k prime range from one to

106
00:06:47,870 --> 00:06:51,913
the total number of channels in that layer of the neural network.

107
00:06:51,913 --> 00:06:55,967
So all this is doing

108
00:06:55,967 --> 00:07:00,225
is summing over the different positions that the image over the height and width and just

109
00:07:00,225 --> 00:07:03,640
multiplying the activations together of

110
00:07:03,640 --> 00:07:08,853
the channels k and k prime and that's the definition of G,k,k prime.

111
00:07:08,853 --> 00:07:14,450
And you do this for every value of k and k prime to compute this matrix G,

112
00:07:14,450 --> 00:07:17,585
also called the style matrix.

113
00:07:17,585 --> 00:07:23,435
And so notice that if both of these activations tend to be lashed together,

114
00:07:23,435 --> 00:07:26,325
then G, k, k prime will be large,

115
00:07:26,325 --> 00:07:28,510
whereas if they are uncorrelated then g,k,

116
00:07:28,510 --> 00:07:30,305
k prime might be small.

117
00:07:30,305 --> 00:07:32,060
And technically, I've been using

118
00:07:32,060 --> 00:07:36,170
the term correlation to convey intuition but this is actually

119
00:07:36,170 --> 00:07:40,130
the unnormalized cross of the areas because we're not

120
00:07:40,130 --> 00:07:46,130
subtracting out the mean and this is just multiplied by these elements directly.

121
00:07:46,130 --> 00:07:50,370
So this is how you compute the style of an image.

122
00:07:50,370 --> 00:07:54,030
And you'd actually do this for both the style image s,n for

123
00:07:54,030 --> 00:08:01,020
the generated image G. So just to distinguish that this is the style image,

124
00:08:01,020 --> 00:08:07,630
maybe let me add a round bracket S there,

125
00:08:07,630 --> 00:08:10,105
just to denote that this is the style image for the image

126
00:08:10,105 --> 00:08:12,715
S and those are the activations on the image

127
00:08:12,715 --> 00:08:21,085
S. And what you do is then compute the same thing for the generated image.

128
00:08:21,085 --> 00:08:28,581
So it's really the same thing summarized sum of a j, a, i,

129
00:08:28,581 --> 00:08:32,670
j, k, l, a,

130
00:08:32,670 --> 00:08:36,678
i, j,k,l and the summation indices are the same.

131
00:08:36,678 --> 00:08:46,130
Let's follow this and you want to just denote this is for the generated image,

132
00:08:46,130 --> 00:08:51,710
I'll just put the round brackets G there.

133
00:08:51,710 --> 00:08:55,540
So, now, you have two matrices they capture what is the style with

134
00:08:55,540 --> 00:08:59,770
the image s and what is the style of the image G. And,

135
00:08:59,770 --> 00:09:05,260
by the way, we've been using the alphabet capital G to denote these matrices.

136
00:09:05,260 --> 00:09:09,445
In linear algebra, these are also called the

137
00:09:09,445 --> 00:09:14,030
grand matrix of these in called grand matrices but in this video,

138
00:09:14,030 --> 00:09:17,680
I'm just going to use the term style matrix because this term grand

139
00:09:17,680 --> 00:09:23,630
matrix that most of these using capital G to denote these matrices.

140
00:09:23,630 --> 00:09:26,035
Finally, the cost function,

141
00:09:26,035 --> 00:09:28,875
the style cost function.

142
00:09:28,875 --> 00:09:34,570
If you're doing this on layer l between s and G,

143
00:09:34,570 --> 00:09:37,050
you can now define that to be

144
00:09:37,050 --> 00:09:44,610
just the difference

145
00:09:44,610 --> 00:09:48,675
between these two matrices,

146
00:09:48,675 --> 00:09:54,265
G l, G square and these are matrices.

147
00:09:54,265 --> 00:09:55,754
So just take it from the previous one.

148
00:09:55,754 --> 00:10:00,660
This is just the sum of squares of the element wise differences between

149
00:10:00,660 --> 00:10:07,065
these two matrices and just divides this out this is going to be sum over k,

150
00:10:07,065 --> 00:10:12,964
sum over k prime of these differences of s, k,

151
00:10:12,964 --> 00:10:17,450
k prime minus G l,

152
00:10:17,450 --> 00:10:24,530
G, k, k prime and then the sum of square of the elements.

153
00:10:24,530 --> 00:10:32,715
The authors actually used this for the normalization constants two times of nh,

154
00:10:32,715 --> 00:10:34,890
nw, in that layer,

155
00:10:34,890 --> 00:10:40,015
nc in that layer and I'll square this and you can put this up here as well.

156
00:10:40,015 --> 00:10:43,600
But a normalization constant doesn't matter that much because this

157
00:10:43,600 --> 00:10:47,485
causes multiplied by some hyperparameter b anyway.

158
00:10:47,485 --> 00:10:48,910
So just to finish up,

159
00:10:48,910 --> 00:10:51,970
this is the style cost function defined

160
00:10:51,970 --> 00:10:55,645
using layer l and as you saw on the previous slide,

161
00:10:55,645 --> 00:11:02,440
this is basically the Frobenius norm between the two star matrices computed on

162
00:11:02,440 --> 00:11:05,953
the image s and on the image G

163
00:11:05,953 --> 00:11:10,810
Frobenius on squared and never by the just low normalization constants,

164
00:11:10,810 --> 00:11:13,255
which isn't that important.

165
00:11:13,255 --> 00:11:18,400
And, finally, it turns out that you get more visually pleasing results if you

166
00:11:18,400 --> 00:11:23,443
use the style cost function from multiple different layers.

167
00:11:23,443 --> 00:11:27,095
So, the overall style cost function,

168
00:11:27,095 --> 00:11:31,305
you can define as sum over

169
00:11:31,305 --> 00:11:37,640
all the different layers of the style cost function for that layer.

170
00:11:37,640 --> 00:11:41,820
We should define the book weighted by some set of parameters,

171
00:11:41,820 --> 00:11:44,160
by some set of additional hyperparameters,

172
00:11:44,160 --> 00:11:46,895
which we'll denote as lambda l here.

173
00:11:46,895 --> 00:11:51,595
So what it does is allows you to use different layers in a neural network.

174
00:11:51,595 --> 00:11:52,815
Well of the early ones,

175
00:11:52,815 --> 00:11:55,800
which measure relatively simpler low level features

176
00:11:55,800 --> 00:11:59,050
like edges as well as some later layers,

177
00:11:59,050 --> 00:12:03,000
which measure high level features and cause a neural network to take

178
00:12:03,000 --> 00:12:08,475
both low level and high level correlations into account when computing style.

179
00:12:08,475 --> 00:12:10,845
And, in the following exercise,

180
00:12:10,845 --> 00:12:13,980
you gain more intuition about what might be

181
00:12:13,980 --> 00:12:19,080
reasonable choices for this type of parameter lambda as well.

182
00:12:19,080 --> 00:12:20,790
And so just to wrap this up,

183
00:12:20,790 --> 00:12:24,660
you can now define the overall cost function

184
00:12:24,660 --> 00:12:30,720
as alpha times the content cost between c and G plus

185
00:12:30,720 --> 00:12:37,515
beta times the style cost between s and G and then just create in the sense

186
00:12:37,515 --> 00:12:40,785
or a more sophisticated optimization algorithm if you want

187
00:12:40,785 --> 00:12:44,696
in order to try to find an image G that normalize,

188
00:12:44,696 --> 00:12:49,590
that tries to minimize this cost function j of G. And if you do that,

189
00:12:49,590 --> 00:12:53,730
you can generate pretty good looking neural artistic

190
00:12:53,730 --> 00:12:59,220
and if you do that you'll be able to generate some pretty nice novel artwork.

191
00:12:59,220 --> 00:13:02,010
So that's it for neural style transfer and I hope you have

192
00:13:02,010 --> 00:13:05,235
fun implementing it in this week's printing exercise.

193
00:13:05,235 --> 00:13:06,625
Before wrapping up this week,

194
00:13:06,625 --> 00:13:08,575
there's just one last thing I want to share of you,

195
00:13:08,575 --> 00:13:11,100
which is how to do convolutions over

196
00:13:11,100 --> 00:13:17,000
1D or 3D data rather than over only 2D images. Let's go into the last video.