1
00:00:00,025 --> 00:00:03,303
[SOUND].

2
00:00:03,303 --> 00:00:08,490
For example, you could try to classify
images for your usual means problems.

3
00:00:08,490 --> 00:00:13,029
This time you're not that interested in
classifying digits in mnist if you want to

4
00:00:13,029 --> 00:00:15,810
classify digits on house numbers,
for example.

5
00:00:15,810 --> 00:00:19,580
But of course, the problem with house
numbers is, let's say you have a label

6
00:00:19,580 --> 00:00:22,944
data set with mnist, but no one gave
you a labeled house numbers yet,

7
00:00:22,944 --> 00:00:25,050
let's just pretend this doesn't exist.

8
00:00:26,190 --> 00:00:29,856
Of course this is a toy problem, but
the same thing arises when you are trying

9
00:00:29,856 --> 00:00:32,319
to for example,
you have an image classifier, but

10
00:00:32,319 --> 00:00:35,490
you want to apply it to classify
images from your social network.

11
00:00:35,490 --> 00:00:39,568
So you go from slightly
different set of photo cameras,

12
00:00:39,568 --> 00:00:44,620
with different set of brands,
just for different content, maybe.

13
00:00:44,620 --> 00:00:49,230
And you want your network to be as
good on this changed kind of domain

14
00:00:49,230 --> 00:00:52,454
on this new data set as it was
on the originally labeled one.

15
00:00:53,720 --> 00:00:58,377
Now, of course you could just stop
earlier and maybe somehow validate, but

16
00:00:58,377 --> 00:01:03,720
let's see how we can improve over this
classical approach with adversarial.

17
00:01:03,720 --> 00:01:07,898
Now your original task is owned by
a classifier or redresser like this one.

18
00:01:07,898 --> 00:01:12,033
Let's split this model into two parts with
the first part, well, the left one tries

19
00:01:12,033 --> 00:01:16,220
to extract features, and the second
part uses them to predict something.

20
00:01:16,220 --> 00:01:20,680
This division is, however,
really arbitrary, so there's no specific

21
00:01:20,680 --> 00:01:25,288
boundary to split this model at,
you can pick arbitrary one.

22
00:01:25,288 --> 00:01:29,950
And the idea here is that the whole model
is usually trained via back propagation.

23
00:01:29,950 --> 00:01:34,900
Now if you want to prevent this
model from overfitting to your

24
00:01:34,900 --> 00:01:39,240
particular domain, let's try to
apply this image to those features.

25
00:01:41,090 --> 00:01:45,027
Here, there is this purple network,
which is our discriminator,

26
00:01:45,027 --> 00:01:47,594
it looks not at the intermediate features.

27
00:01:47,594 --> 00:01:51,651
It tries to judge how
the model sees the world, and

28
00:01:51,651 --> 00:01:57,449
it tries to distinguish between those
features as your model processes

29
00:01:57,449 --> 00:02:03,260
the initial training set of images,
and the target domain of images.

30
00:02:03,260 --> 00:02:08,120
So it basically tries to see whether there
is any difference between your model

31
00:02:08,120 --> 00:02:13,050
behaving on training objects, and
on those kind of out-of-domain,

32
00:02:13,050 --> 00:02:15,379
the target domain,
social network images, for example.

33
00:02:16,690 --> 00:02:20,050
Now the question is, let's say
that your purple network succeeds.

34
00:02:20,050 --> 00:02:25,331
It reaches almost 100% accuracy in
telling whether your network is currently

35
00:02:25,331 --> 00:02:30,317
processing training image, or
it's trying to press the validation image.

36
00:02:30,317 --> 00:02:31,960
The question is, is it good or bad?

37
00:02:33,010 --> 00:02:38,100
Well, yeah, exactly, if your model is
able to distinguish simply by looking

38
00:02:38,100 --> 00:02:42,894
at features what kind of image it is,
it means the representation your neural

39
00:02:42,894 --> 00:02:47,380
network learned is different for, well,
for training and validation images.

40
00:02:47,380 --> 00:02:50,210
If something is different from training
and validation, it's usually a bad sign.

41
00:02:50,210 --> 00:02:52,769
In this case, it's a really bad sign,
it means your model overfits.

42
00:02:53,810 --> 00:02:58,739
Aside from its regional loss, this L
classifier here, you also add those kinds

43
00:02:58,739 --> 00:03:03,980
of adversarial components, yes, this
adversarial ratio of probability of real.

44
00:03:03,980 --> 00:03:07,291
And this basically means that you
want to train those features,

45
00:03:07,291 --> 00:03:09,940
this kind of left part of
your classifier network.

46
00:03:09,940 --> 00:03:14,433
In order to make it indistinguishable
between how it operates on the training

47
00:03:14,433 --> 00:03:17,280
data, and
how it operates on the target domain.

48
00:03:18,450 --> 00:03:23,160
And again,
you train those two models simultaneously,

49
00:03:23,160 --> 00:03:26,250
you try to optimize the kind of
mixed objective at the classifier.

50
00:03:26,250 --> 00:03:32,086
And of course you can slightly tune it by
scaling the adversarial in the classifier

51
00:03:32,086 --> 00:03:38,830
by a multiplicative constant, say, kind
of a regularization factor, if you wish.

52
00:03:38,830 --> 00:03:43,090
And this way, you can obtain a model
that tries to adapt toward the main for

53
00:03:43,090 --> 00:03:45,090
which you don't even
need to have any labels.

54
00:03:45,090 --> 00:03:49,460
So it doesn't actually need to have
a label of social network images,

55
00:03:49,460 --> 00:03:51,020
so labeled house numbers.

56
00:03:51,020 --> 00:03:56,711
What it needs is labels or
labeled image net.

57
00:03:56,711 --> 00:04:00,250
Hence, unlabeled data from your
target domain, so basically,

58
00:04:00,250 --> 00:04:02,130
this is a very powerful idea here.

59
00:04:02,130 --> 00:04:06,046
And since I'm promoting the idea that deep
learning is a kind of language you speak

60
00:04:06,046 --> 00:04:09,809
to your machine learning model to describe
what you actually want it to learn.

61
00:04:09,809 --> 00:04:13,538
This kind of adversarial approach
gives you the words of power,

62
00:04:13,538 --> 00:04:15,760
which is called indistinguishable.

63
00:04:15,760 --> 00:04:21,040
So if you want some kind of behavior
indistinguishable between one case and

64
00:04:21,040 --> 00:04:23,290
another case,
you can train the discriminator, and

65
00:04:23,290 --> 00:04:26,200
try to optimize it in
an adversarial manner.

66
00:04:27,420 --> 00:04:33,440
So this was the element called
applicational adversarial networks, but

67
00:04:33,440 --> 00:04:38,570
just to unwind, let's try to cover
some of the more fancy aspects here.

68
00:04:39,620 --> 00:04:44,610
You probably all the cool artificial
intelligent apps called prisma,

69
00:04:44,610 --> 00:04:46,558
or acmon app, or maybe artistA.

70
00:04:46,558 --> 00:04:51,750
The Prisma is probably
an overwhelming favorite here.

71
00:04:51,750 --> 00:04:56,420
The idea is that those apps try
to morph your image in a way that

72
00:04:56,420 --> 00:05:00,172
follows the artistic style
of a particular painting,

73
00:05:00,172 --> 00:05:04,880
or maybe a particular style of art,
like impressionism, for example.

74
00:05:04,880 --> 00:05:09,315
And you do this by, so far, magically

75
00:05:09,315 --> 00:05:14,270
inserting your image in the super mega
image box and waiting for a minute.

76
00:05:14,270 --> 00:05:20,619
Let's cover the math, the nuts and
bolts of how you actually do that.

77
00:05:20,619 --> 00:05:24,330
Again, you have to make some
representation of model

78
00:05:24,330 --> 00:05:26,760
indistinguishable here.

79
00:05:26,760 --> 00:05:31,074
You won't need the specified trainable
discriminator, but you want to somehow

80
00:05:31,074 --> 00:05:35,273
obtain an image representation that
only preserves this style information.

81
00:05:35,273 --> 00:05:38,418
So you want to define the style for
an image,

82
00:05:38,418 --> 00:05:43,900
in a way that the representation you
get only covers style but not content.

83
00:05:43,900 --> 00:05:46,069
So basically, you have to mimic style, but

84
00:05:46,069 --> 00:05:48,310
you have to preserve
the content of an image.

85
00:05:48,310 --> 00:05:51,960
If you have a selfie,
you want to still see your face on it,

86
00:05:51,960 --> 00:05:56,280
but the style, the texture, should
be like Monet, or something similar.

87
00:05:58,290 --> 00:05:58,980
This is, again,

88
00:05:58,980 --> 00:06:02,175
a non-mathematical problem, or
well, heuristical, if you wish.

89
00:06:02,175 --> 00:06:07,320
You could try to define this art style
by trying to take a pre-trained network,

90
00:06:07,320 --> 00:06:09,830
from image network models, for example.

91
00:06:09,830 --> 00:06:13,300
And taking some kind of
representation from this network,

92
00:06:13,300 --> 00:06:15,360
maybe, that only captures
local information.

93
00:06:15,360 --> 00:06:19,290
Of course you won't be able to just take
it, you'll have to compute something that

94
00:06:19,290 --> 00:06:21,910
preserves the kind of low
level text information.

95
00:06:21,910 --> 00:06:26,510
But that throws away all the higher order
features, like what's on the image.

96
00:06:26,510 --> 00:06:28,910
Now, is there some way you can
think of this transformation?

97
00:06:30,580 --> 00:06:34,210
And there is of course more than
one way you could do that, and

98
00:06:34,210 --> 00:06:37,840
it's likely that at least some of
you managed to land on something

99
00:06:37,840 --> 00:06:40,350
greater than the idea we're
going to cover right now.

100
00:06:40,350 --> 00:06:42,040
But what you could do, at least,

101
00:06:42,040 --> 00:06:47,627
is you could take filters from
some lower layers in pretrained.

102
00:06:47,627 --> 00:06:50,580
So some kind of, not too deep,
or shallow enough so

103
00:06:50,580 --> 00:06:55,720
that the filters only catch texture and
super small image details.

104
00:06:55,720 --> 00:06:58,650
And you can try to either average over

105
00:06:58,650 --> 00:07:00,804
the whole image like global
average [INAUDIBLE].

106
00:07:00,804 --> 00:07:03,736
We're trying to computer the gram matrix
over this kind of two-dimensional

107
00:07:03,736 --> 00:07:05,050
activation map.

108
00:07:05,050 --> 00:07:10,600
And the intuition here, if you don't
know the math of gram matrixes,

109
00:07:10,600 --> 00:07:12,860
you can try to explain
it the following way.

110
00:07:12,860 --> 00:07:18,599
You try to compute how
frequently do texture features

111
00:07:18,599 --> 00:07:24,222
kind of coexist,
coincide at adjacent locations.

112
00:07:24,222 --> 00:07:28,193
And use this descriptor, you just
compute this for all features, and

113
00:07:28,193 --> 00:07:31,154
you use these metrics as
kind of a style descriptor,

114
00:07:31,154 --> 00:07:33,940
you use this as a representation
of an art style.

115
00:07:33,940 --> 00:07:38,863
Now you could compute this cell descriptor
for your kind of reference image,

116
00:07:38,863 --> 00:07:41,626
say fiery nights or some Monet paintings.

117
00:07:41,626 --> 00:07:47,795
And then you could try to compute the same
descriptor for your selfie, for example.

118
00:07:47,795 --> 00:07:50,947
Now [INAUDIBLE] size is
going to be pretty different,

119
00:07:50,947 --> 00:07:56,960
because your selfie isn't obviously
a painting, it was not painted by a brush.

120
00:07:56,960 --> 00:07:59,760
But the idea here is that those two
representations, those two descriptors,

121
00:07:59,760 --> 00:08:01,150
when they are different,

122
00:08:01,150 --> 00:08:04,160
you can compute some difference
between them, [INAUDIBLE] error.

123
00:08:04,160 --> 00:08:06,730
And this whole procedure is
going to be differentiable,

124
00:08:06,730 --> 00:08:09,320
which is a very important part here,
because, remember,

125
00:08:09,320 --> 00:08:14,590
we take filters from
a differentiable neural network.

126
00:08:14,590 --> 00:08:18,380
And then we compute the gram matrix, or
we just average over the whole field,

127
00:08:18,380 --> 00:08:22,450
which is kind of simpler but
yields less impressive results.

128
00:08:22,450 --> 00:08:25,720
We just average or compute the matrix, and

129
00:08:25,720 --> 00:08:28,970
then we compute the differences
between those two gram matrices.

130
00:08:28,970 --> 00:08:33,053
And this is in fact, well,
just a set of multiplications,

131
00:08:33,053 --> 00:08:37,640
additions and maybe some if
your network allows for that.

132
00:08:37,640 --> 00:08:41,747
Now if you try to then adjust your image,
you take your selfie and

133
00:08:41,747 --> 00:08:45,552
you try to adjust your selfie
to make its image descriptor,

134
00:08:45,552 --> 00:08:49,604
this gram matrix, similar to
the one of your reference image.

135
00:08:49,604 --> 00:08:55,716
Say Fiery Night, then your selfie will
slowly take on features of this painting,

136
00:08:55,716 --> 00:08:59,810
this Fiery Night painting,
but not the content of it.

137
00:09:01,030 --> 00:09:05,108
Since we're just optimizing the texture so
far, this is going to be, well,

138
00:09:05,108 --> 00:09:06,920
this is going to be quite inferior,

139
00:09:06,920 --> 00:09:11,520
because the image may even lose its
content as it tries to optimize textures.

140
00:09:11,520 --> 00:09:16,462
So let's also add this kind of content
analysis, so we want an image which

141
00:09:16,462 --> 00:09:22,400
looks like Fiery Nights or any other
painting you want in terms of texture,

142
00:09:22,400 --> 00:09:26,440
which also looks like your
selfie in terms of content.

143
00:09:26,440 --> 00:09:30,140
Now, where do you get content, how do
you divide it from any other content?

144
00:09:31,600 --> 00:09:35,167
If you want higher level kind of
[INAUDIBLE], you can just go deeper.

145
00:09:35,167 --> 00:09:39,187
You can take maybe a pre file dense layer,
or some of the top conversion layers,

146
00:09:39,187 --> 00:09:40,871
depending on the architecture.

147
00:09:40,871 --> 00:09:44,420
And again, you can just skip [INAUDIBLE]
is going to be perfectly differentiable.

148
00:09:44,420 --> 00:09:47,956
And then weigh it up by adding some of
the multiplicative coefficients to each of

149
00:09:47,956 --> 00:09:49,530
those differences.

150
00:09:49,530 --> 00:09:51,942
And then you could minimize them
over the pieces of image, so

151
00:09:51,942 --> 00:09:54,980
you're going to start with a random image,
or your selfie.

152
00:09:54,980 --> 00:09:59,920
Random image is slightly better
because and then you just morph

153
00:09:59,920 --> 00:10:04,393
it by following gradient direction,
or any other optimization.

154
00:10:04,393 --> 00:10:07,256
I'll be following the gradient of this
kind of texture dissimilarity and

155
00:10:07,256 --> 00:10:08,960
content dissimilarity.

156
00:10:08,960 --> 00:10:13,100
This builds your image which inherits
the texture for the selfie, and

157
00:10:13,100 --> 00:10:16,810
the content from the, sorry,
the content from the selfie and

158
00:10:16,810 --> 00:10:20,280
the texture from the painting, of course.

159
00:10:20,280 --> 00:10:22,680
Now, here's an example of how
this thing actually works.

160
00:10:22,680 --> 00:10:26,630
This photo was morphed to look like,
to resemble Gogh's style of painting.

161
00:10:26,630 --> 00:10:28,972
And this is of course a slightly modified,

162
00:10:28,972 --> 00:10:31,573
it's like a hacky version
of the algorithm, so

163
00:10:31,573 --> 00:10:36,269
it's not just activation of one layer for
textures, and another layer for content.

164
00:10:36,269 --> 00:10:40,605
It will include a model description of how
do you actually, which layers do you use,

165
00:10:40,605 --> 00:10:42,131
what networks do you use, and

166
00:10:42,131 --> 00:10:45,998
what methods of optimization do you
actually apply to get faster results.

167
00:10:45,998 --> 00:10:48,501
We'll include all this stuff
into the reading section.

168
00:10:48,501 --> 00:10:52,347
Now we can introduce you to follow this
URL here, and try taking this on yourself,

169
00:10:52,347 --> 00:10:54,432
of course if you have
not done this already.

170
00:10:54,432 --> 00:10:56,623
See you in the next section.

171
00:10:56,623 --> 00:11:06,623
[MUSIC]