1
00:00:00,570 --> 00:00:03,330
What are deep ConvNets really learning?

2
00:00:03,330 --> 00:00:07,420
In this video, I want to share with you
some visualizations that will help you

3
00:00:07,420 --> 00:00:11,900
hone your intuition about what the deeper
layers of a ConvNet really are doing.

4
00:00:11,900 --> 00:00:14,845
And this will help us think
through how you can implement

5
00:00:14,845 --> 00:00:17,015
neural style transfer as well.

6
00:00:17,015 --> 00:00:18,735
Let's start with an example.

7
00:00:18,735 --> 00:00:23,895
Lets say you've trained a ConvNet,
this is an alex net like network,

8
00:00:23,895 --> 00:00:28,885
and you want to visualize what the hidden
units in different layers are computing.

9
00:00:28,885 --> 00:00:30,255
Here's what you can do.

10
00:00:30,255 --> 00:00:33,920
Let's start with a hidden unit in layer 1.

11
00:00:33,920 --> 00:00:38,790
And suppose you scan through your training
sets and find out what are the images or

12
00:00:38,790 --> 00:00:43,960
what are the image patches that
maximize that unit's activation.

13
00:00:43,960 --> 00:00:49,260
So in other words pause your training set
through your neural network, and figure

14
00:00:49,260 --> 00:00:55,250
out what is the image that maximizes
that particular unit's activation.

15
00:00:55,250 --> 00:00:58,130
Now, notice that a hidden unit in layer 1,

16
00:00:58,130 --> 00:01:02,750
will see only a relatively small
portion of the neural network.

17
00:01:02,750 --> 00:01:08,170
And so if you visualize, if you plot
what activated unit's activation,

18
00:01:08,170 --> 00:01:11,360
it makes makes sense to plot
just a small image patches,

19
00:01:11,360 --> 00:01:15,190
because all of the image that
that particular unit sees.

20
00:01:15,190 --> 00:01:20,220
So if you pick one hidden unit and
find the nine input images that maximizes

21
00:01:20,220 --> 00:01:25,380
that unit's activation, you might
find nine image patches like this.

22
00:01:25,380 --> 00:01:30,293
So looks like that in the lower region
of an image that this particular hidden

23
00:01:30,293 --> 00:01:34,691
unit sees, it's looking for an egde or
a line that looks like that.

24
00:01:34,691 --> 00:01:39,091
So those are the nine image
patches that maximally activate

25
00:01:39,091 --> 00:01:41,559
one hidden unit's activation.

26
00:01:41,559 --> 00:01:47,055
Now, you can then pick a different hidden
unit in layer 1 and do the same thing.

27
00:01:47,055 --> 00:01:51,149
So that's a different hidden unit,
and looks like this second one,

28
00:01:51,149 --> 00:01:54,170
represented by these 9 image patches here.

29
00:01:54,170 --> 00:01:58,679
Looks like this hidden unit is looking for
a line sort of in that portion of its

30
00:01:58,679 --> 00:02:02,540
input region,
we'll also call this receptive field.

31
00:02:02,540 --> 00:02:07,075
And if you do this for other hidden units,
you'll find other hidden units,

32
00:02:07,075 --> 00:02:11,140
tend to activate in image
patches that look like that.

33
00:02:11,140 --> 00:02:15,180
This one seems to have a preference for
a vertical light edge, but

34
00:02:15,180 --> 00:02:18,710
with a preference that
the left side of it be green.

35
00:02:18,710 --> 00:02:23,882
This one really prefers orange colors,
and this is an interesting image patch.

36
00:02:23,882 --> 00:02:29,730
This red and green together will make
a brownish or a brownish-orangish color,

37
00:02:29,730 --> 00:02:34,294
but the neuron is still happy to
activate with that, and so on.

38
00:02:34,294 --> 00:02:38,399
So this is nine different
representative neurons and for

39
00:02:38,399 --> 00:02:43,369
each of them the nine image patches
that they maximally activate on.

40
00:02:43,369 --> 00:02:48,066
So this gives you a sense that,
units, train hidden units in layer 1,

41
00:02:48,066 --> 00:02:49,868
they're often looking for

42
00:02:49,868 --> 00:02:55,010
relatively simple features such as edge or
a particular shade of color.

43
00:02:55,010 --> 00:02:57,920
And all of the examples I'm using in

44
00:02:57,920 --> 00:03:01,160
this video come from this
paper by Mathew Zeiler and

45
00:03:01,160 --> 00:03:06,440
Rob Fergus, titled visualizing and
understanding convolutional networks.

46
00:03:06,440 --> 00:03:10,689
And I'm just going to use one
of the simpler ways to visualize

47
00:03:10,689 --> 00:03:14,680
what a hidden unit in a neural
network is computing.

48
00:03:14,680 --> 00:03:18,706
If you read their paper, they have
some other more sophisticated ways of

49
00:03:18,706 --> 00:03:21,480
visualizing when the ConvNet
is running as well.

50
00:03:22,520 --> 00:03:26,064
But now you have repeated this
procedure several times for

51
00:03:26,064 --> 00:03:28,380
nine hidden units in layer 1.

52
00:03:28,380 --> 00:03:29,900
What if you do this for

53
00:03:29,900 --> 00:03:33,790
some of the hidden units in the deeper
layers of the neuron network.

54
00:03:33,790 --> 00:03:37,950
And what does the neural network
then learning at a deeper layers.

55
00:03:37,950 --> 00:03:43,120
So in the deeper layers, a hidden unit
will see a larger region of the image.

56
00:03:43,120 --> 00:03:46,540
Where at the extreme end each pixel

57
00:03:46,540 --> 00:03:51,560
could hypothetically affect the output of
these later layers of the neural network.

58
00:03:51,560 --> 00:03:55,040
So later units are actually
seen larger image patches,

59
00:03:55,040 --> 00:03:59,580
I'm still going to plot the image patches
as the same size on these slides.

60
00:03:59,580 --> 00:04:04,370
But if we repeat this procedure, this
is what you had previously for layer 1,

61
00:04:04,370 --> 00:04:09,610
and this is a visualization of what
maximally activates nine different

62
00:04:09,610 --> 00:04:12,320
hidden units in layer 2.

63
00:04:12,320 --> 00:04:15,416
So I want to be clear about
what this visualization is.

64
00:04:15,416 --> 00:04:20,864
These are the nine patches that cause
one hidden unit to be highly activated.

65
00:04:20,864 --> 00:04:25,471
And then each grouping, this is
a different set of nine image patches that

66
00:04:25,471 --> 00:04:27,940
cause one hidden unit to be activated.

67
00:04:27,940 --> 00:04:32,630
So this visualization shows nine
hidden units in layer 2, and

68
00:04:32,630 --> 00:04:36,700
for each of them shows nine image
patches that causes that hidden unit

69
00:04:36,700 --> 00:04:39,890
to have a very large output,
a very large activation.

70
00:04:39,890 --> 00:04:44,120
And you can repeat these for
deeper layers as well.

71
00:04:44,120 --> 00:04:46,980
Now, on this slide, I know it's
kind of hard to see these tiny little

72
00:04:46,980 --> 00:04:49,188
image patches, so
let me zoom in for some of them.

73
00:04:49,188 --> 00:04:52,170
For layer 1, this is what you saw.

74
00:04:52,170 --> 00:04:58,270
So for example, this is that first unit
we saw which was highly activated, if

75
00:04:58,270 --> 00:05:03,840
in the region of the input image, you can
see there's an edge maybe at that angle.

76
00:05:03,840 --> 00:05:08,350
Now let's zoom in for layer 2 as well,
to that visualization.

77
00:05:08,350 --> 00:05:09,750
So this is interesting,

78
00:05:09,750 --> 00:05:14,030
layer 2 looks it's detecting more
complex shapes and patterns.

79
00:05:14,030 --> 00:05:17,300
So for example, this hidden unit
looks like it's looking for

80
00:05:17,300 --> 00:05:21,110
a vertical texture with
lots of vertical lines.

81
00:05:21,110 --> 00:05:24,150
This hidden unit looks like
its highly activated when

82
00:05:24,150 --> 00:05:27,730
there's a rounder shape to
the left part of the image.

83
00:05:27,730 --> 00:05:33,690
Here's one that is looking for
very thin vertical lines and so on.

84
00:05:33,690 --> 00:05:38,810
And so the features the second layer is
detecting are getting more complicated.

85
00:05:38,810 --> 00:05:39,980
How about layer 3?

86
00:05:39,980 --> 00:05:43,476
Let's zoom into that,
in fact let me zoom in even bigger, so

87
00:05:43,476 --> 00:05:48,156
you can see this better, these are the
things that maximally activate layer 3.

88
00:05:48,156 --> 00:05:52,806
But let's zoom in even bigger, and
so this is pretty interesting again.

89
00:05:52,806 --> 00:05:57,139
It looks like there is a hidden
unit that seems to respond highly

90
00:05:57,139 --> 00:06:01,981
to a rounder shape in the lower left
hand portion of the image, maybe.

91
00:06:01,981 --> 00:06:06,523
So that ends up detecting a lot of cars,
dogs and

92
00:06:06,523 --> 00:06:10,735
wonders is even starting to detect people.

93
00:06:10,735 --> 00:06:15,593
And this one look like it is detecting
certain textures like honeycomb shapes,

94
00:06:15,593 --> 00:06:18,358
or square shapes, this irregular texture.

95
00:06:18,358 --> 00:06:22,356
And some of these it's difficult to look
at and manually figure out what is it

96
00:06:22,356 --> 00:06:26,900
detecting, but it is clearly starting
to detect more complex patterns.

97
00:06:26,900 --> 00:06:27,880
How about the next layer?

98
00:06:27,880 --> 00:06:30,903
Well, here is layer 4, and
you'll see that the features or

99
00:06:30,903 --> 00:06:33,508
the patterns is detecting or
even more complex.

100
00:06:33,508 --> 00:06:37,410
It looks like this has learned
almost a dog detector, but

101
00:06:37,410 --> 00:06:39,250
all these dogs likewise similar, right?

102
00:06:39,250 --> 00:06:42,770
Is this, I don't know what dog species or
dog breed this is.

103
00:06:42,770 --> 00:06:47,700
But now all those are dogs, but
they look relatively similar as dogs go.

104
00:06:47,700 --> 00:06:51,680
Looks like this hidden unit and
therefore it is detecting water.

105
00:06:53,130 --> 00:06:58,440
This looks like it is actually
detecting the legs of a bird and so on.

106
00:06:58,440 --> 00:07:02,740
And then layer 5 is detecting
even more sophisticated things.

107
00:07:02,740 --> 00:07:07,396
So you'll notice there's also a neuron
that seems to be a dog detector,

108
00:07:07,396 --> 00:07:12,450
but set of dogs detecting
here seems to be more varied.

109
00:07:12,450 --> 00:07:17,610
And then this seems to be detecting
keyboards and things with a keyboard

110
00:07:17,610 --> 00:07:22,360
like texture, although maybe
lots of dots against background.

111
00:07:22,360 --> 00:07:27,950
I think this neuron here may be detecting
text, it's always hard to be sure.

112
00:07:27,950 --> 00:07:31,520
And then this one here
is detecting flowers.

113
00:07:31,520 --> 00:07:35,430
So we've gone a long way from
detecting relatively simple things

114
00:07:35,430 --> 00:07:38,135
such as edges in layer 1
to textures in layer 2,

115
00:07:38,135 --> 00:07:43,780
up to detecting very complex
objects in the deeper layers.

116
00:07:43,780 --> 00:07:48,420
So I hope this gives you some better
intuition about what the shallow and

117
00:07:48,420 --> 00:07:52,120
deeper layers of a neural
network are computing.

118
00:07:52,120 --> 00:07:56,763
Next, let's use this intuition to
start building a neural-style transfer

119
00:07:56,763 --> 00:07:57,530
algorithm.