1
00:00:00,000 --> 00:00:03,581
[MUSIC]

2
00:00:03,581 --> 00:00:05,270
For once, in this entire course,

3
00:00:05,270 --> 00:00:08,665
I actually get to explain you
some pieces that I really like.

4
00:00:08,665 --> 00:00:10,111
Let's talk about generative models.

5
00:00:10,111 --> 00:00:15,189
Now there are a lot of bunch of general,
generic, unsupervised generative models.

6
00:00:15,189 --> 00:00:16,684
But unlike auto-encoders,

7
00:00:16,684 --> 00:00:19,682
they try to solve all the possible
problems equally badly.

8
00:00:19,682 --> 00:00:23,145
Generative models, generative
adversarial networks, for example,

9
00:00:23,145 --> 00:00:25,028
try to generate images specifically.

10
00:00:25,028 --> 00:00:29,077
And since they specialize,
they get to do this thing really well.

11
00:00:29,077 --> 00:00:33,708
Now the problem with generating stuff is a
broad one, whether you generate images, or

12
00:00:33,708 --> 00:00:37,653
if you generate music, voice, or
well, any abstract sound recording.

13
00:00:37,653 --> 00:00:40,928
We'll try to generate abstract
measurements of data from your

14
00:00:40,928 --> 00:00:45,502
favorite machine learning competition, or
even complicated events from large hadron

15
00:00:45,502 --> 00:00:49,663
collider, but we'll study the generative
models based on image generation.

16
00:00:49,663 --> 00:00:52,484
The reason here is kind of simple.

17
00:00:52,484 --> 00:00:55,059
When there is something wrong
with a generative model,

18
00:00:55,059 --> 00:00:56,974
it's usually kind of hard to tell it, and

19
00:00:56,974 --> 00:01:00,340
there's no obvious way you can judge
whether a model is better or worse.

20
00:01:00,340 --> 00:01:02,034
But with images there is an exception.

21
00:01:02,034 --> 00:01:06,992
And usually, it's pretty much easy to tell
if a face generated by model is wrong

22
00:01:06,992 --> 00:01:08,852
if it has, say, three eyes.

23
00:01:08,852 --> 00:01:12,337
So this kind of gives you an intuitive
advantage to understanding models.

24
00:01:12,337 --> 00:01:15,116
So, let's start to find
out how to generate stuff.

25
00:01:15,116 --> 00:01:19,248
The pipeline here is kind of the reverse
of what you usually have when classifying.

26
00:01:19,248 --> 00:01:22,026
In classification problem,
you have an image, and then you, well,

27
00:01:22,026 --> 00:01:24,994
kind of reduce the image down with
convolution, pooling, convolution,

28
00:01:24,994 --> 00:01:27,276
pooling, convolution,
pooling, you know the drill.

29
00:01:27,276 --> 00:01:32,002
And then, you use some dense layers or
well, any other deep learning architecture

30
00:01:32,002 --> 00:01:36,187
you like to actually try to predict
the final label, and your output is not

31
00:01:36,187 --> 00:01:40,741
an image, but just, basically,
a vector of numbers, a small one, usually.

32
00:01:40,741 --> 00:01:42,385
Today's pipeline is
going to be the reverse.

33
00:01:42,385 --> 00:01:44,778
But don't get scared by
this feature just yet.

34
00:01:44,778 --> 00:01:49,650
What you have here, well,
previous slide showed you how to classify

35
00:01:49,650 --> 00:01:53,104
a chair into a particular type and
orientation.

36
00:01:53,104 --> 00:01:57,761
But now, we're going to have those types
and orientation as maybe some random

37
00:01:57,761 --> 00:02:01,761
noise or maybe some kind of other
description of a chair here, and

38
00:02:01,761 --> 00:02:06,734
we'll try to convert this description
back into original image, into pixels.

39
00:02:06,734 --> 00:02:09,289
So this is how you can solve this problem.

40
00:02:09,289 --> 00:02:14,968
You have a description of an object, try
to predict an image, and you finalize your

41
00:02:14,968 --> 00:02:20,406
model via this pixelwise, mean squared
error between the predicted pixels and

42
00:02:20,406 --> 00:02:24,647
the actual kind of reference chair or
other object pictures.

43
00:02:24,647 --> 00:02:30,007
Now this kind of works, but the problem
here is that this generative task is,

44
00:02:30,007 --> 00:02:34,224
in this case, it cannot be
ideally solved mathematically.

45
00:02:34,224 --> 00:02:38,152
You can probably draw more than one chair
that satisfies all those properties.

46
00:02:38,152 --> 00:02:40,719
And for more complicated problems,
it's even worse.

47
00:02:40,719 --> 00:02:45,863
If you are tasked at generating, say,
white, male, middle-aged [INAUDIBLE]

48
00:02:45,863 --> 00:02:50,414
the face, there's more than one
face that satisfies this property.

49
00:02:50,414 --> 00:02:53,933
I'm sorry, I might be slightly biased when
it comes to describing the face here.

50
00:02:53,933 --> 00:02:56,714
So the problem here is that
if there is more than one

51
00:02:56,714 --> 00:03:01,286
way you can predict something, if there
are more than one kind of correct answer,

52
00:03:01,286 --> 00:03:04,955
then if you try to minimize
the squared error, it's going to suck.

53
00:03:04,955 --> 00:03:05,781
Hard.

54
00:03:05,781 --> 00:03:10,764
So the problem here is that if you have,
say, two possible hair colors,

55
00:03:10,764 --> 00:03:14,848
say, blonde or dark hair,
then if you try to minimize MSE,

56
00:03:14,848 --> 00:03:19,286
you're going to predict the average,
kind of unrealistic hair.

57
00:03:19,286 --> 00:03:24,365
If there's a possibility to include
some facial hair or no facial hair,

58
00:03:24,365 --> 00:03:28,713
it'll have this kind of seemingly
transparent facial hair.

59
00:03:28,713 --> 00:03:31,737
You'll get this average image that
doesn't look like anything real.

60
00:03:31,737 --> 00:03:36,820
So the problem here is that mean
squared error is bad, like real bad.

61
00:03:36,820 --> 00:03:41,244
But we want to avoid this pixelwise MSE,
we want to generate effectively.

62
00:03:41,244 --> 00:03:45,227
Otherwise, we'll only be able
to get those average images,

63
00:03:45,227 --> 00:03:48,300
like the ones you obtain
from auto-encoders.

64
00:03:48,300 --> 00:03:52,755
Now, the question here is kind of fun
to define once I've kind of move to

65
00:03:52,755 --> 00:03:57,885
presentation again, but we actually want
to find some efficient representation

66
00:03:57,885 --> 00:04:02,677
that only has a few properties, so
it has to cover the higher order feature.

67
00:04:02,677 --> 00:04:06,810
It has to kind of have a little semantic
feature, like, if there are faces,

68
00:04:06,810 --> 00:04:10,425
it might be a presence of some facial
hair or orientation of a face,

69
00:04:10,425 --> 00:04:14,392
and maybe color of eyes, but
it doesn't have to contain all the pixels.

70
00:04:14,392 --> 00:04:21,500
Distance between the two representations
of images that we're going to minimize,

71
00:04:21,500 --> 00:04:26,456
it shouldn't be the pixelwise
position of everything.

72
00:04:26,456 --> 00:04:27,965
Instead, it should be kind
of semantic distance.

73
00:04:27,965 --> 00:04:32,594
So if two persons both have facial hair,
same skin tone, and more or less,

74
00:04:32,594 --> 00:04:35,974
same face position,
then the error should be narrow,

75
00:04:35,974 --> 00:04:39,654
even though they might be
slightly shifted, for example.

76
00:04:39,654 --> 00:04:43,511
But the issue here is that we already
have some kind of representation.

77
00:04:43,511 --> 00:04:46,967
We're actually obtaining those
representations with other kinds of

78
00:04:46,967 --> 00:04:50,482
networks previously, and
this representation kind of automatically

79
00:04:50,482 --> 00:04:53,363
happens when we try to solve
the classification problem.

80
00:04:53,363 --> 00:04:55,677
Now, what kind of representation is this?

81
00:04:55,677 --> 00:05:00,062
Now, of course there is more than
one correct way you can do that, but

82
00:05:00,062 --> 00:05:04,978
with the popular approach we're going to
expand or operate on now is that you can

83
00:05:04,978 --> 00:05:09,137
use the space that gets trained
when you beta propagate,

84
00:05:09,137 --> 00:05:12,865
when you train network
on image classification.

85
00:05:12,865 --> 00:05:17,447
So if you try to classify classes in image
net, and there's all kinds of stuff there,

86
00:05:17,447 --> 00:05:21,713
and the features you're not [INAUDIBLE]
are more or less parts of those objects,

87
00:05:21,713 --> 00:05:26,170
and maybe kinds of textures, but they kind
of contain all the semantic or high level

88
00:05:26,170 --> 00:05:30,719
information, especially if you go deep
into the network, near the output layers.

89
00:05:30,719 --> 00:05:33,730
What hey do not contain is they
don't contain orientation.

90
00:05:33,730 --> 00:05:34,676
Or position.

91
00:05:34,676 --> 00:05:35,665
Okay.

92
00:05:35,665 --> 00:05:38,021
Orientation might be slightly
more important here.

93
00:05:38,021 --> 00:05:42,972
The trick with position here is that
it doesn't actually depend on where on

94
00:05:42,972 --> 00:05:44,475
your image the cat is.

95
00:05:44,475 --> 00:05:46,499
If you try to classify it,
it's still a cat.

96
00:05:46,499 --> 00:05:51,436
Therefore, it's convenient for a network
to learn features that don't change

97
00:05:51,436 --> 00:05:55,065
much if your cat's position
kind of changes slightly, and

98
00:05:55,065 --> 00:05:58,199
this is the exact kind of
representation you need.

99
00:05:58,199 --> 00:06:02,404
So what you can do is you can take,
for example, some intermediate layer,

100
00:06:02,404 --> 00:06:04,101
deep enough in your own network.

101
00:06:04,101 --> 00:06:07,289
You can use the activations of this layer,
and say,

102
00:06:07,289 --> 00:06:10,927
squared error of those activations
as your target metric.

103
00:06:10,927 --> 00:06:14,809
So you have your original image, and
it's difficult minimizing pixelwise

104
00:06:14,809 --> 00:06:17,771
between your original reference image and
your printed image.

105
00:06:17,771 --> 00:06:22,296
You could compute those kind of
convolutional [INAUDIBLE] features for

106
00:06:22,296 --> 00:06:27,598
both images and compute the mean squared
error within those high level features.

107
00:06:27,598 --> 00:06:28,818
So this is a very powerful approach.

108
00:06:28,818 --> 00:06:33,397
You'd use your previously trained
classifier as another kind of specially

109
00:06:33,397 --> 00:06:36,337
trained metric here to
train a different model.

110
00:06:36,337 --> 00:06:39,863
And basically,
all it takes to train such a model,

111
00:06:39,863 --> 00:06:44,127
to employ such an approach,
is a pre-trained classifier for

112
00:06:44,127 --> 00:06:49,871
a reasonably large classification problem
adjacent to your generation problem.

113
00:06:49,871 --> 00:06:54,435
So if you are, say, classifying faces,
it makes sense to then generate faces and

114
00:06:54,435 --> 00:06:55,174
vice versa.

115
00:06:55,174 --> 00:06:59,206
But in many domains, this kind of
resource is freely available for

116
00:06:59,206 --> 00:07:01,798
image classification because image net,

117
00:07:01,798 --> 00:07:06,702
but for many other problems like with
lead generation, it's much more scarce.

118
00:07:06,702 --> 00:07:09,589
So let's now try to find out how do we,
maybe,

119
00:07:09,589 --> 00:07:14,332
get this intermediate network
specifically to humans, for our problems,

120
00:07:14,332 --> 00:07:18,497
instead of just trying to borrow
it from another similar problem.

121
00:07:18,497 --> 00:07:28,497
[MUSIC]